[00:20:16] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:21:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:38:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954373
[00:38:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954373 (owner: 10TrainBranchBot)
[00:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954373 (owner: 10TrainBranchBot)
[01:04:23] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10AntiCompositeNumber)
[01:50:22] <icinga-wm>	 RECOVERY - snapshot of s6 in eqiad on backupmon1001 is OK: Last snapshot for s6 at eqiad (db1225) taken on 2023-09-04 01:06:36 (505 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:08:58] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:58] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:49:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:51:12] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:51:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:56:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 8.918 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:00:38] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:01:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:04:54] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:05:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.336 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:10:20] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:12:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:12:36] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:13:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:13:38] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.220 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:13:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:17:42] <wikibugs>	 (03PS2) 10KartikMistry: Update MinT to 2023-08-31-061147-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683)
[05:19:15] <wikibugs>	 (03PS2) 10KartikMistry: Enable Section and Content Translation in 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953756 (https://phabricator.wikimedia.org/T343211)
[05:48:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:49:20] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:54:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 5.935 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:55:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:58:48] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:59:50] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:08:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:09:02] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:11:26] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.908 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:12:10] <XioNoX>	 !log push new pfw policies - T345288
[06:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:17:50] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:19:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:20:00] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.015 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:20:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 4.359 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:25:58] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:26:30] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:26:36] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:27:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 3.332 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:27:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:28:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.359 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:33:58] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:43:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:43:50] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:45:22] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:48:18] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:49:04] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.366 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:52:42] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:53:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:54:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.316 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:55:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:56:28] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:56:54] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 3.912 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:00:06] <jouncebot>	 Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T0700).
[07:00:06] <jouncebot>	 Aca, kart_, and aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:15] * Aca waves
[07:00:19] * kart_ is here
[07:00:50] <aanzx>	 * o/
[07:00:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:01:26] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:01:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:01:28] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:01:29] <imdeni>	 Huh, that seems like a good incentive to break a wiki. /s
[07:02:41] <kart_>	 Please ping me when Aca's patches are deployed.
[07:02:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:03:42] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:04:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:08:27] <wikibugs>	 (03CR) 10Deni: [C: 03+1] "Approved by sh.wiki community." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954240 (https://phabricator.wikimedia.org/T345513) (owner: 10Acamicamacaraca)
[07:09:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:13:40] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:14:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] infra_devices: remove parents for multihomed devices [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi)
[07:16:54] <kart_>	 Anyone deploying config patches? Amir1 taavi urbanecm ?
[07:17:14] <Aca>	 umm seems like they are not here :O
[07:17:17] <imdeni>	 Good question indeed.
[07:18:17] <kart_>	 :/
[07:19:02] <Aca>	 It's Monday, rightfully so :')
[07:19:07] <kart_>	 I can deploy but I have limited time today. If no one around, Please reschedule your patches for the next backport/config window.
[07:19:19] <kart_>	 Aca: Agree :)
[07:21:18] <Aca>	 Okiee. I'll be around, so if anyone else wants to deploy, please ping me. Otherwise, I'll reschedule my patches.
[07:22:49] <moritzm>	 !log failover ganeti masters in drmrs to ganeti6001/ganeti6002
[07:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:46] <kart_>	 Let me deploy atleast my patch :)
[07:26:26] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[07:26:44] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[07:28:44] <imdeni>	 _kart: Is it a regular occurrence that deployers are not showing up?
[07:29:05] <kart_>	 imdeni: not really. 
[07:29:49] <kart_>	 (I have issue with login to deployment server), so I also have to withdraw my patch.
[07:29:58] <imdeni>	 Hmm. I just find it odd that three people have signed up and no-one is here.
[07:31:58] <imdeni>	 kart_: Do you know who is responsible for putting together the schedule?
[07:32:56] <elukey>	 logmsgbot
[07:33:01] <elukey>	 err
[07:34:21] <kart_>	 imdeni: Ah. Today is 'No deploy' day (US holiday)
[07:34:52] <imdeni>	 kart_: Oh, labor day. Where do you see this?
[07:35:35] <Emperor>	 !log restart tcpircbot-logmsgbot on alert1001
[07:35:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:19] <imdeni>	 I see it. https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar
[07:37:13] <Aca>	 Damn, didn't know that
[07:37:33] <Aca>	 rescheduling
[07:39:13] <imdeni>	 This should really be fixed, I think between us we are in 3-4 different timezones.
[07:39:54] <imdeni>	 Tagging @thcipriani
[07:40:35] <imdeni>	 It would be great if you could take a look at changing the bot to account for this when you get a chance.
[07:43:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete OS conditionals [puppet] - 10https://gerrit.wikimedia.org/r/954279 (owner: 10Muehlenhoff)
[07:45:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff)
[07:46:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Simplify IPMI check [puppet] - 10https://gerrit.wikimedia.org/r/954257 (owner: 10Muehlenhoff)
[07:47:01] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:47:13] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:47:43] <wikibugs>	 (03PS1) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394)
[07:47:47] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:49:20] <wikibugs>	 (03PS2) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394)
[07:50:31] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:50:51] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 6.850 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:51:02] <wikibugs>	 (03PS3) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394)
[07:51:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276 (owner: 10Muehlenhoff)
[07:52:14] <wikibugs>	 (03PS1) 10Muehlenhoff: mariadb::packages_client: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954594
[07:53:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954594 (owner: 10Muehlenhoff)
[07:54:31] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:54:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:55:00] <wikibugs>	 (03Abandoned) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm)
[07:55:41] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:56:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: re-add space to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/954355 (owner: 10Majavah)
[07:56:01] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 3.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:56:47] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.509 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:57:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276 (owner: 10Muehlenhoff)
[07:57:34] <wikibugs>	 (03PS2) 10Muehlenhoff: etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276
[07:59:08] <wikibugs>	 (03PS2) 10Muehlenhoff: lxc: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954274
[07:59:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2004.codfw.wmnet
[07:59:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[08:00:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2004.codfw.wmnet
[08:00:26] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:00:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] lxc: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954274 (owner: 10Muehlenhoff)
[08:00:57] <elukey>	 !log restart kubelet on ml-serve1002 to check if stale prometheus metrics are the cause of the stop_container alert
[08:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:12] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
[08:03:33] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet
[08:04:30] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.423 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:08:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4004.wikimedia.org
[08:09:12] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw
[08:10:06] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet
[08:10:24] <wikibugs>	 (03PS1) 10Muehlenhoff: bastion: Update canary in Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/954595
[08:10:42] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet
[08:11:20] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1002.eqiad.wmnet
[08:13:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:13:23] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet
[08:14:02] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet
[08:14:25] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw
[08:14:34] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad
[08:14:48] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet
[08:14:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2003.codfw.wmnet
[08:15:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] bastion: Update canary in Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/954595 (owner: 10Muehlenhoff)
[08:15:43] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet
[08:15:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:15:54] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2002.codfw.wmnet
[08:17:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:17:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:17:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast4004.wikimedia.org
[08:17:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet
[08:17:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast4004.wikimedia.org` - bast4004.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Alertmanager   - F...
[08:18:09] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet
[08:18:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:18:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5003.wikimedia.org
[08:18:44] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet
[08:19:05] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad
[08:19:30] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:22:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:23:08] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:23:33] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubestagemaster1002.eqiad.wmnet
[08:23:59] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet
[08:24:28] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:25:11] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[08:27:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:28:22] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubestagemaster2002.codfw.wmnet
[08:30:46] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:56] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:31:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:31:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:31:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5003.wikimedia.org
[08:31:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast5003.wikimedia.org` - bast5003.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Alertmanager   - F...
[08:31:32] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad
[08:31:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:33:40] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry1003.eqiad.wmnet
[08:33:44] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-eqiad
[08:34:50] <elukey>	 !log rename "ens5" to "ens13" on orespoolcounter2003's /etc/network/interfaces after a VM reboot
[08:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:55] <elukey>	 lovely --^
[08:35:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: mysql: expose tcp port to cloud-private networks only [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[08:36:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast6002.wikimedia.org
[08:37:01] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad
[08:38:05] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1003.eqiad.wmnet
[08:38:29] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet
[08:39:09] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-eqiad
[08:41:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:41:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2003.codfw.wmnet
[08:41:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:42:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) We are already working at service level with this box. We should coordinate reimage/reboots etc.
[08:43:32] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet
[08:44:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:45:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:45:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:45:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast6002.wikimedia.org
[08:45:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast6002.wikimedia.org` - bast6002.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Alertmanager   - F...
[08:45:56] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw
[08:46:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:46:49] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2003.codfw.wmnet
[08:46:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1004.eqiad.wmnet
[08:49:56] <wikibugs>	 10SRE-swift-storage, 10Commons: File not found on commons - https://phabricator.wikimedia.org/T345522 (10Aklapper) [Unrelated to MediaWiki software code but about file storage on Wikimedia server and thumbnails]
[08:50:16] <wikibugs>	 (03CR) 10Ladsgroup: "Generally looks fine, I think we need to keep ServiceOps in the loop." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey)
[08:51:28] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2003.codfw.wmnet
[08:51:46] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet
[08:53:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast4004/bast5003/bast6002 [puppet] - 10https://gerrit.wikimedia.org/r/954597 (https://phabricator.wikimedia.org/T343515)
[08:54:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10MoritzMuehlenhoff)
[08:56:48] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet
[08:56:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast4004/bast5003/bast6002 [puppet] - 10https://gerrit.wikimedia.org/r/954597 (https://phabricator.wikimedia.org/T343515) (owner: 10Muehlenhoff)
[08:57:25] <elukey>	 !log rename "ens5" to "ens13" on orespoolcounter1004's /etc/network/interfaces after a VM reboot
[08:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1004.eqiad.wmnet
[09:04:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1003.eqiad.wmnet
[09:07:11] <wikibugs>	 (03CR) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey)
[09:09:11] <elukey>	 !log rename "ens5" to "ens13" on orespoolcounter1003's /etc/network/interfaces after a VM reboot
[09:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:49] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1003.eqiad.wmnet
[09:13:48] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad
[09:14:41] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1129.eqiad.wmnet with OS bullseye
[09:17:13] <wikibugs>	 (03PS1) 10JMeybohm: AQS2: Lower replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954600
[09:19:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot)
[09:21:03] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] AQS2: Lower replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954600 (owner: 10JMeybohm)
[09:22:04] <wikibugs>	 (03Merged) 10jenkins-bot: AQS2: Lower replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954600 (owner: 10JMeybohm)
[09:25:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Removing +2, digging a bit more into the history chain and currently deployed version, this is apparently already done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot)
[09:27:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
[09:27:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
[09:27:55] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage
[09:28:34] <akosiaris>	 !log deploying mathoid to bump service mesh envoy version to 1.23.10-2-s2. No changes to the app.
[09:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:07] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply
[09:29:07] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage
[09:29:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply
[09:30:04] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[09:30:20] <wikibugs>	 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Vgutierrez)
[09:32:50] <godog>	 mmhh datahub-mae-consumer on kubestage is spamming logstash hard, can we do sth about it ?
[09:33:04] <godog>	 https://logstash.wikimedia.org/goto/ec41fa68ba494813fa14be88c941b807
[09:33:18] <godog>	 akosiaris jayme ^ ?
[09:33:35] <jayme>	 if that's new then it's probably related to me rebooting the cluster nodes
[09:33:46] <jayme>	 else cc btullis
[09:33:55] <godog>	 I'd say so too, seems to have started with the reboots
[09:34:04] <godog>	 around 8:38
[09:34:04] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad
[09:34:26] <btullis>	 Looking now. Also stevemunene will probably want to know.
[09:34:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply
[09:34:31] <jayme>	 wow...really quite noisy
[09:34:44] <godog>	 yeah, maxing out logstash :(
[09:35:00] <btullis>	 We can kill the pod. It's not ingesting anything.
[09:35:09] <jayme>	 should probably be fixed in some way still
[09:35:54] <jayme>	 if killing it/some other datahub component leads to this, that is problematic
[09:36:18] <btullis>	 Agreed. I was only speaking about short term management.
[09:36:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "I am removing my -1 and switch to +2 per comment. I also remove physikerwelt's -1 to let CI procced." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot)
[09:36:54] <godog>	 +1 to kill the pod as a mitigation, also +1 to decode the stack trace to see what's wrong
[09:37:31] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot)
[09:37:52] <godog>	 jayme: what's the recommended way to stop datahub-mae-consumer in this case ?
[09:37:55] <jayme>	 ack. I've killed the pod
[09:38:15] <jayme>	 godog: no idea. I did "kubectl -n datahub delete po datahub-mae-consumer-main-ff6cb7484-pjsqd"
[09:38:25] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Add CP secret synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[09:39:04] <godog>	 jayme: ack, I'll check if the new pod still spams, I'd imagine it does though
[09:39:16] <btullis>	 Confused: I was running `btullis@deploy1002:~$ kubectl logs -f datahub-mae-consumer-main-ff6cb7484-pjsqd datahub-mae-consumer-main` and I didn't see a lot of logs. The latest I saw was at 09:25. Will check logstash.
[09:39:43] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[09:40:22] <godog>	 btullis: see also the link I posted above in case you missed
[09:40:28] <godog>	 "you missed it" even
[09:40:34] <btullis>	 Thanks. Just found and clicked.
[09:40:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging to not break the relation chain (and it's a bit easier than the manual rebase the dependent commit requires)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot)
[09:40:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot)
[09:41:11] <jayme>	 btullis: would you take care of this? That would be nice
[09:41:36] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot)
[09:41:43] <btullis>	 jayme: Yes I will.
[09:41:45] <jayme>	 reboots are done, to there should be no out band po deletions
[09:41:48] <jayme>	 cool, thanks!
[09:42:04] <godog>	 thank you jayme btullis ! appreciate it
[09:42:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[09:42:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[09:42:54] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1129.eqiad.wmnet with OS bullseye
[09:43:01] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[09:43:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:43:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[09:44:15] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot)
[09:44:30] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] taskgen: update for tox 4 syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954297 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah)
[09:44:45] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[09:45:24] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Add CP secret (duration: 15m 47s)
[09:45:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot)
[09:45:54] <jbond>	 then /go jayme 
[09:46:24] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot)
[09:47:33] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001"
[09:47:45] <jbond>	 !log disable-puppet fleet wide  "deploy confd change gerrit:954007"
[09:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:02] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply
[09:48:18] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply
[09:48:25] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001"
[09:48:25] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:48:34] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[09:48:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:49:00] <akosiaris>	 !log T345290. Update mathoid to 2023-05-13-192519-production
[09:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:02] <stashbot>	 T345290: Deploy a more recent version of Mathoid to production than 2023-02-21 - https://phabricator.wikimedia.org/T345290
[09:49:07] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[09:49:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[09:49:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[09:50:00] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[09:51:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: drop -next suffix from ns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/954605 (https://phabricator.wikimedia.org/T342621)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1000)
[10:01:38] <wikibugs>	 10SRE-swift-storage, 10Commons: File not found on commons - https://phabricator.wikimedia.org/T345522 (10MatthewVernon) @Shizhao the file looks OK to me, what's the problem with this image, please?
[10:03:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: drop -next suffix from ns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/954605 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[10:04:42] <wikibugs>	 (03PS1) 10Elukey: ml-services: tune autoscaling for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954608 (https://phabricator.wikimedia.org/T344058)
[10:05:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) @ppenloglou please let us know if you need help submitting a new SSH key for the production environment. Otherwise we will close this task
[10:05:43] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --quiet fails - https://phabricator.wikimedia.org/T345548 (10Volans) p:05Triage→03High
[10:06:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) p:05Triage→03Medium
[10:09:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Dear @Ladsgroup,  Thanks for letting me know about the misuse of my ssh key. Could you guide in the right direction for the following? Currently, I would like to be a able to:...
[10:11:47] <wikibugs>	 (03PS1) 10Volans: run-puppet-agent: fails with --quiet [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548)
[10:12:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10BTullis) 05Open→03Resolved Apologies again for the delay @OSefu-WMF - As mentioned, I'll carry on investigating the missin...
[10:12:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Volans) >>! In T342534#9137952, @Papaul wrote: > @jbond @Volans on 2027 - 2029 > puppet is failing with  > ` > ----- OUTPUT of 'run-puppet-agent --quiet' -----...
[10:13:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) @ppenloglou that's right. as stated in https://wikitech.wikimedia.org/wiki/People.wikimedia.org people.wm.o is part of the production environment and the SSH key can't be shar...
[10:13:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet
[10:13:28] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --quiet fails - https://phabricator.wikimedia.org/T345548 (10Volans)
[10:14:55] <wikibugs>	 (03PS1) 10Jbond: confd: Only notify the current instance [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669)
[10:16:51] <wikibugs>	 (03PS2) 10Majavah: openstack: Remove a bunch of Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294)
[10:18:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm Doh!" [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) (owner: 10Volans)
[10:18:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] run-puppet-agent: fails with --quiet [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) (owner: 10Volans)
[10:19:08] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: tune autoscaling for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954608 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey)
[10:19:11] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: Remove a bunch of Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[10:19:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) (owner: 10Volans)
[10:20:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: tune autoscaling for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954608 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey)
[10:20:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10Vgutierrez) 05Open→03Stalled p:05Triage→03Medium a:03Vgutierrez Waiting for OOB validation
[10:20:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43128/console" [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[10:20:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] confd: Only notify the current instance [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[10:21:14] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::kubeadm: remove version defaults [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah)
[10:21:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] confd: Only notify the current instance [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[10:22:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet
[10:24:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet
[10:25:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet
[10:29:05] <jbond>	 !log enable-puppet fleet wide post  "deploy confd change gerrit:954007"
[10:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix use of more than one src/dst sets [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497)
[10:31:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Thank you @Vgutierrez for your reply, now it makes sense.  I've created a new SSH key locally saved as "id_ed25519_wmprod.pub" so I can tell them apart.  And it is: **ssh-ed25...
[10:31:40] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila)
[10:33:58] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:11] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTBase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos)
[10:34:21] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:34:24] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTBase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) 05Open→03Resolved
[10:39:37] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10MatthewVernon) It's a little difficult to see what might have happened here, since you've overwritten the "original" path. The on...
[10:40:43] <wikibugs>	 (03Abandoned) 10Majavah: hieradata: drop ldap-labtest acme-chier cert [puppet] - 10https://gerrit.wikimedia.org/r/885026 (owner: 10Majavah)
[10:45:13] <wikibugs>	 (03Abandoned) 10Hnowlan: Add discovery records for device-analytics [dns] - 10https://gerrit.wikimedia.org/r/917306 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan)
[10:45:42] <wikibugs>	 (03PS1) 10JMeybohm: site.pp: Split wikikube workers per DC [puppet] - 10https://gerrit.wikimedia.org/r/954615 (https://phabricator.wikimedia.org/T342534)
[10:46:39] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey)
[10:49:07] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] site.pp: Split wikikube workers per DC [puppet] - 10https://gerrit.wikimedia.org/r/954615 (https://phabricator.wikimedia.org/T342534) (owner: 10JMeybohm)
[10:51:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet
[10:52:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Aklapper) @lojo_wmde: Could you please answer the last comment? Thanks in advance!
[10:59:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet
[11:01:59] <wikibugs>	 (03CR) 10Jbond: Fix use of more than one src/dst sets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:02:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] site.pp: Split wikikube workers per DC [puppet] - 10https://gerrit.wikimedia.org/r/954615 (https://phabricator.wikimedia.org/T342534) (owner: 10JMeybohm)
[11:03:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10JMeybohm) >>! In T342534#9137952, @Papaul wrote: > [...] > so 2025 and 2026 nodes had 2 roles, insetup and kubernetes::worker roles    That was overlook...
[11:05:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet
[11:05:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet
[11:08:05] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: " - jbond@cumin1001 - T342534"
[11:08:08] <stashbot>	 T342534: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534
[11:08:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for vhargyono [puppet] - 10https://gerrit.wikimedia.org/r/954618
[11:08:56] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: " - jbond@cumin1001 - T342534"
[11:11:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10jbond) FYi i just checked kubernetes2025 (via install-console), kubernetes2027 and kubernetes2029 and puppet seems to be running well now
[11:14:13] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) I've put a very brief summary of using the cookbook on Wikitech here:  https://wikitech.wikimedia.org/wiki/ZTP_Ne...
[11:15:33] <wikibugs>	 (03PS1) 10Urbanecm: beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556)
[11:15:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond)
[11:15:56] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructure - https://phabricator.wikimedia.org/T341497 (10jbond) 05Open→03Resolved a:03jbond this has been completed
[11:16:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for vhargyono [puppet] - 10https://gerrit.wikimedia.org/r/954618 (owner: 10Muehlenhoff)
[11:16:47] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond)
[11:17:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[11:17:28] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) 05In progress→03Resolved a:03jbond This is now in place
[11:19:08] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214)
[11:19:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:21:56] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214)
[11:21:58] <wikibugs>	 (03PS1) 10Jbond: puppetdb-api: switch dev sevices back to puppetdb-api [puppet] - 10https://gerrit.wikimedia.org/r/954647 (https://phabricator.wikimedia.org/T342214)
[11:22:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:23:07] <wikibugs>	 (03PS3) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214)
[11:24:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43130/console" [puppet] - 10https://gerrit.wikimedia.org/r/954647 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:24:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43129/console" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:26:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:27:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:27:52] <wikibugs>	 (03PS4) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214)
[11:29:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43131/console" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:30:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix use of more than one src/dst sets [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497)
[11:31:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:35:41] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1002.eqiad.wmnet with OS bullseye
[11:36:55] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --quiet fails - https://phabricator.wikimedia.org/T345548 (10Volans) 05Open→03Resolved Change has been merged and by now deployed everywhere. Resolving.
[11:37:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:37:39] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:38:07] <wikibugs>	 (03PS1) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811)
[11:38:35] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@26bc1a5]: Add new wikis T343543 T343549 T345171
[11:38:40] <stashbot>	 T343549: Add suwikisource to RESTBase - https://phabricator.wikimedia.org/T343549
[11:38:41] <stashbot>	 T343543: Add blkwiktionary to RESTBase - https://phabricator.wikimedia.org/T343543
[11:38:41] <stashbot>	 T345171: Add tlywiki to RESTBase - https://phabricator.wikimedia.org/T345171
[11:42:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:42:26] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[11:42:26] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/media/math/check/tex: Ti https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:44:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/
[11:44:12] <icinga-wm>	 Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:44:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:44:32] <icinga-wm>	 /v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.113:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://
[11:44:32] <icinga-wm>	 .113:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:44:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi
[11:44:44] <icinga-wm>	 /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.125:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.
[11:44:44] <icinga-wm>	 5:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:45:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikip
[11:45:22] <icinga-wm>	 /v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/re
[11:45:26] <wikibugs>	 (03PS2) 10Btullis: Update Presto TLS configuration in production [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642)
[11:46:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikip
[11:46:10] <icinga-wm>	 /v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while down
[11:46:10] <icinga-wm>	 http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:46:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:46:24] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url htt
[11:46:24] <icinga-wm>	 4.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:46:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage
[11:46:58] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: override to enable cloud-private for designate [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240)
[11:47:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{
[11:47:00] <icinga-wm>	 Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.165:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.165:7231/
[11:47:00] <icinga-wm>	 edia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:47:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi
[11:47:22] <icinga-wm>	 /page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbas
[11:47:43] <hnowlan>	 looking
[11:47:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:48:04] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: override to enable cloud-private for designate [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240)
[11:48:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Rebuild Java images to update to latest OpenJDK 11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954655
[11:49:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage
[11:49:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[11:49:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:49:54] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[11:49:54] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.104:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:50:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Could not fetch url http://10.64.16.38:7231/en.wikipedia.org/v1/page/title/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Generic connection error: HTTPConnectionPool(host=10.64.16.38, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/page/title/User%3ABSitzmann_%28WMF
[11:50:02] <icinga-wm>	 S%2FTest%2FFrankenstein (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd9f0695c18: Failed to establish a new connection: [Errno 111] Connection refused)): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Could not fetch url http://10.64.16.38:7231/en.wikipedia.org/v1/page/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Generic connection error: HTTPConnectionP
[11:50:02] <icinga-wm>	 =10.64.16.38, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/page/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein (Caused by NewConnectionError(urlli https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:50:03] <hnowlan>	 rolling back 
[11:50:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:50:38] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url htt
[11:50:38] <icinga-wm>	 4.16.117:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.117:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:50:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wik
[11:50:42] <icinga-wm>	 rg/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:51:14] <moritzm>	 !log installing grub2 updates from Bullseye point relese
[11:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:17] <moritzm>	 !log installing grub2 updates from Bullseye point release
[11:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-ht
[11:51:38] <icinga-wm>	 e} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:52:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi
[11:52:28] <icinga-wm>	 /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:52:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:52:28] <icinga-wm>	 /v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:52:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/mat
[11:52:52] <icinga-wm>	 {type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.71:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.71:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:53:07] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@26bc1a5]: Add new wikis T343543 T343549 T345171 (duration: 14m 32s)
[11:53:12] <stashbot>	 T343549: Add suwikisource to RESTBase - https://phabricator.wikimedia.org/T343549
[11:53:12] <stashbot>	 T343543: Add blkwiktionary to RESTBase - https://phabricator.wikimedia.org/T343543
[11:53:13] <stashbot>	 T345171: Add tlywiki to RESTBase - https://phabricator.wikimedia.org/T345171
[11:53:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:53:14] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[11:53:14] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.117:7231/en.wikipedia.org/v1/media/math/check/tex: https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:53:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi
[11:53:18] <icinga-wm>	 /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbas
[11:54:52] <wikibugs>	 (03PS2) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811)
[11:54:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[11:54:54] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[11:54:54] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/media/math/check/tex: Ti https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:55:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet
[11:56:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add a Hiera option to enable ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561)
[11:56:14] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on api canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561)
[11:56:16] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on appserver canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954658 (https://phabricator.wikimedia.org/T345561)
[11:56:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561)
[11:56:20] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/954660 (https://phabricator.wikimedia.org/T345561)
[11:56:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561)
[11:56:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954662 (https://phabricator.wikimedia.org/T345561)
[11:56:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/954663 (https://phabricator.wikimedia.org/T345561)
[11:56:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/954664 (https://phabricator.wikimedia.org/T345561)
[11:56:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on api hosts [puppet] - 10https://gerrit.wikimedia.org/r/954665 (https://phabricator.wikimedia.org/T345561)
[11:56:32] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable icu67 component on appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/954666 (https://phabricator.wikimedia.org/T345561)
[11:56:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:57:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:59:27] <wikibugs>	 (03CR) 10Muehlenhoff: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris)
[11:59:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet
[12:00:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:00:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[12:00:20] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[12:00:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:00:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris)
[12:00:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.6
[12:00:38] <icinga-wm>	 :7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:01:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:01:02] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:01:16] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add a Hiera option to enable ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561)
[12:01:18] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on api canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561)
[12:01:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on appserver canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954658 (https://phabricator.wikimedia.org/T345561)
[12:01:22] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561)
[12:01:24] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/954660 (https://phabricator.wikimedia.org/T345561)
[12:01:26] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561)
[12:01:29] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954662 (https://phabricator.wikimedia.org/T345561)
[12:01:31] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/954663 (https://phabricator.wikimedia.org/T345561)
[12:01:33] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/954664 (https://phabricator.wikimedia.org/T345561)
[12:01:35] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on api hosts [puppet] - 10https://gerrit.wikimedia.org/r/954665 (https://phabricator.wikimedia.org/T345561)
[12:01:37] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable icu67 component on appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/954666 (https://phabricator.wikimedia.org/T345561)
[12:01:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox2002.codfw.wmnet
[12:02:47] <wikibugs>	 (03PS3) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811)
[12:02:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:02:52] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:03:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:03:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 27381
[12:04:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:04:29] <wikibugs>	 (03PS4) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811)
[12:04:56] <wikibugs>	 (03CR) 10Muehlenhoff: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris)
[12:05:17] <wikibugs>	 (03Abandoned) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[12:05:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:05:46] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:05:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:06:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2002.codfw.wmnet
[12:06:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:07:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno
[12:07:08] <icinga-wm>	 s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:07:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:07:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:07:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:08:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox1002.eqiad.wmnet
[12:09:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:10:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:10:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:10:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:10:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:11:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:12:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:12:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:13:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:13:22] <wikibugs>	 (03PS5) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214)
[12:13:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:13:24] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add parameter to change the port that puppetdb runs [puppet] - 10https://gerrit.wikimedia.org/r/954669 (https://phabricator.wikimedia.org/T342214)
[12:13:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:14:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43136/console" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:15:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:15:12] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:15:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:16:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:16:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:16:14] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:16:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:16:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status
[12:16:46] <icinga-wm>	 pecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:17:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:17:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb-api: switch dev sevices back to puppetdb-api [puppet] - 10https://gerrit.wikimedia.org/r/954647 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:17:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:17:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:18] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox1002.eqiad.wmnet
[12:18:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[12:18:48] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could n
[12:18:48] <icinga-wm>	  url http://10.64.48.125:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.125:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43137/console" [puppet] - 10https://gerrit.wikimedia.org/r/954669 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:19:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:19:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: add parameter to change the port that puppetdb runs [puppet] - 10https://gerrit.wikimedia.org/r/954669 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:19:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:20:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:20:35] <wikibugs>	 (03PS6) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214)
[12:20:46] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:21:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[12:21:54] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:25] <wikibugs>	 (03PS1) 10Hnowlan: hieradata: remove restbase1030 from ratelimit list [puppet] - 10https://gerrit.wikimedia.org/r/954672
[12:22:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[12:22:48] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[12:22:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: remove restbase1030 from ratelimit list [puppet] - 10https://gerrit.wikimedia.org/r/954672 (owner: 10Hnowlan)
[12:23:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:15] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] hieradata: remove restbase1030 from ratelimit list [puppet] - 10https://gerrit.wikimedia.org/r/954672 (owner: 10Hnowlan)
[12:23:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status
[12:23:22] <icinga-wm>	 pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:23:45] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 27381
[12:23:58] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136065
[12:24:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:24:04] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:24:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:24:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:24:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:24:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:24:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:24:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:25:30] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:24] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136065
[12:26:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:26:34] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:26:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno
[12:26:54] <icinga-wm>	 s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:27:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 
[12:27:02] <icinga-wm>	 ng: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/c
[12:27:02] <icinga-wm>	  https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:27:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:27:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:27:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:27:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:28:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:28:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:28:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:28:52] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:28:58] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138884
[12:29:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[12:29:00] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[12:29:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:29:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:29:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:29:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:29:54] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:29:55] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138884
[12:30:06] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 149665
[12:30:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:30:12] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:30:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 149665
[12:30:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:30:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:30:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:30:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:31:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:31:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix use of more than one src/dst sets [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:31:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:31:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno
[12:32:06] <icinga-wm>	 s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:33:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:33:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:33:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:33:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:33:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:34:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:34:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: jaeger: match production opensearch replica/shard settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954675 (https://phabricator.wikimedia.org/T344952)
[12:34:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno
[12:34:38] <icinga-wm>	 s returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:34:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:34:38] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:34:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:35:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:35:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:35:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status
[12:35:46] <icinga-wm>	 pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:36:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:36:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:36:54] <wikibugs>	 (03PS1) 10Hnowlan: hieradata: change restbase seeds to reflect downed node [puppet] - 10https://gerrit.wikimedia.org/r/954676
[12:36:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retriev
[12:36:58] <icinga-wm>	 cements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:37:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:37:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: change restbase seeds to reflect downed node [puppet] - 10https://gerrit.wikimedia.org/r/954676 (owner: 10Hnowlan)
[12:37:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:37:58] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] hieradata: change restbase seeds to reflect downed node [puppet] - 10https://gerrit.wikimedia.org/r/954676 (owner: 10Hnowlan)
[12:39:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:39:46] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:39:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:40:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:41:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:43:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:43:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:44:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:44:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:44:30] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:44:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:44:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:45:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Cookbook sre.puppet.sync-netbox-hiera sets 'public' var for all IPv6 GUA to true - https://phabricator.wikimedia.org/T345473 (10jbond) @cmooney we did discuss this in the original task (329669#8744920) however there wasn't really a conclusion.  the tl;dr is * its not curren...
[12:45:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.117:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.117:7
[12:45:20] <icinga-wm>	 ikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:46:31] <hnowlan>	 !log staggered restarting restbase service on A:restbase 
[12:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi
[12:47:12] <icinga-wm>	 /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.or
[12:47:12] <icinga-wm>	 ia/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloadin https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[12:47:26] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[12:47:26] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.104:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[12:47:26] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[12:47:26] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex: https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[12:47:38] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[12:47:38] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.173:7231/en.wikipedia.org/v1/media/math/check/tex: https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[12:47:38] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[12:47:38] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:47:51] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826)
[12:47:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wi
[12:47:56] <icinga-wm>	 org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while d
[12:47:56] <icinga-wm>	 ng http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:48:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:48:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:48:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:48:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[12:48:20] <XioNoX>	 yo
[12:48:34] <akosiaris>	 XioNoX: hnowlan and me are already debugging RESTBase
[12:48:40] <XioNoX>	 hnowlan: looks like it's related to your work? Everything under control?
[12:48:43] <akosiaris>	 it isn't feeling well for >1 h now
[12:48:44] <XioNoX>	 akosiaris: cool
[12:48:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:48:56] <akosiaris>	 it just paged, probably due to the restarts
[12:49:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip
[12:49:02] <icinga-wm>	 /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res
[12:49:02] <icinga-wm>	 s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.38:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:04] <XioNoX>	 yeah that's why I'm here
[12:49:11] <hnowlan>	 XioNoX: apologies 
[12:49:16] <XioNoX>	 no pb at all
[12:49:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi
[12:49:20] <icinga-wm>	 /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.125:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.125:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:56] <XioNoX>	 I acked the page
[12:49:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:50:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:50:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:50:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:50:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received ht
[12:50:38] <icinga-wm>	 kitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:50:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:50:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:51:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:51:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:52:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:52:10] <XioNoX>	 let me know if you need help
[12:52:19] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1002.eqiad.wmnet with OS bullseye
[12:52:27] <akosiaris>	 will do
[12:52:45] <XioNoX>	 I acked the above page (cache_text)
[12:53:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:53:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:53:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:53:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:55:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add .bash_aliases file for btullis [puppet] - 10https://gerrit.wikimedia.org/r/952475 (owner: 10Btullis)
[12:55:52] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240)
[12:56:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:56:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[12:56:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:58:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: designate: override to enable cloud-private for designate [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[12:58:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[12:59:01] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240)
[13:00:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[13:04:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:51] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826)
[13:10:18] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43138/console" [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[13:12:01] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:15:18] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:17:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove visualdiff client/server from testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/954682 (https://phabricator.wikimedia.org/T345220)
[13:18:38] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:22:38] <wikibugs>	 (03PS1) 10Cathal Mooney: Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938)
[13:23:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[13:24:22] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001"
[13:25:03] <wikibugs>	 (03PS2) 10Cathal Mooney: Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938)
[13:27:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[13:27:22] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) Completing what @MatthewVernon correctly says, there are my findings:  * File existed and was being in use at least in 2...
[13:27:38] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) The file: F37653248 (note it is the original because it's sha1 is 8c6169221e33cb1857f183d46bb4d6d9177240f2 or gebtj7wmiz...
[13:30:58] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) My recommendation is to upload the original attached here as the latest version with a link to this t...
[13:31:03] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) p:05Triage→03High
[13:41:11] <wikibugs>	 (03CR) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[13:41:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:45:46] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001"
[13:45:46] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:46:01] <wikibugs>	 (03PS2) 10Muehlenhoff: pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287
[13:46:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:47:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:47:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Cookbook sre.puppet.sync-netbox-hiera sets 'public' var for all IPv6 GUA to true - https://phabricator.wikimedia.org/T345473 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T345473#9140191, @jbond wrote: > @cmooney we did discuss this in the original task (329669#874...
[13:48:06] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:48:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:48:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:49:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1001.wikimedia.org
[13:49:46] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001"
[13:49:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:50:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:50:32] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001"
[13:50:32] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:52:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:53:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1001.wikimedia.org
[13:54:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:54:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:55:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:56:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:58:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:58:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:58:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:58:58] <wikibugs>	 (03PS1) 10Btullis: Update the maximum message size in kafka for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/954690 (https://phabricator.wikimedia.org/T344688)
[14:00:31] <wikibugs>	 (03PS1) 10Elukey: ml-services: set minReplicas to 1 for drafttopic's staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/954691
[14:04:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[14:04:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:07:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff)
[14:07:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: make it talk to cloudcontrol via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954692 (https://phabricator.wikimedia.org/T345240)
[14:08:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954692 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[14:08:58] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:35] <wikibugs>	 (03PS8) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814
[14:09:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:11:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: set minReplicas to 1 for drafttopic's staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/954691 (owner: 10Elukey)
[14:11:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 (owner: 10Majavah)
[14:11:59] <wikibugs>	 (03PS11) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[14:14:29] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update backup source for s2, x1 to be MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/954693
[14:16:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: make it talk to cloudcontrol via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954692 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[14:17:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update backup source for s2, x1 to be MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/954693 (owner: 10Jcrespo)
[14:18:28] <jynus>	 mine can be merged if there is a conflict
[14:18:30] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[14:18:52] <jynus>	 there wasn't
[14:18:59] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:20:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) the key needs to be uploaded to the puppet repo, you could use this CR as an example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949839 or I could craft a new one fo...
[14:21:05] <wikibugs>	 (03PS1) 10Elukey: ml-services: set min/max replicas for Outlink in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954695
[14:23:35] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: set min/max replicas for Outlink in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954695 (owner: 10Elukey)
[14:24:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: set min/max replicas for Outlink in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954695 (owner: 10Elukey)
[14:27:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:48] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[14:28:27] <wikibugs>	 (03PS1) 10Cathal Mooney: Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938)
[14:29:14] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: additional keystone overrides for cloud-private migration [puppet] - 10https://gerrit.wikimedia.org/r/954698 (https://phabricator.wikimedia.org/T345240)
[14:29:20] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[14:29:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mesh: new networkpolicy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/954210 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:29:47] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) >>! In T345265#9134920, @kamila wrote: > Thank you @Trizek-WMF ! The message looks good. Maybe I'd...
[14:29:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954698 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[14:30:36] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Rebuild Java images to update to latest OpenJDK 11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954655 (owner: 10Muehlenhoff)
[14:31:24] <godog>	 !log bounce prometheus@k8s-aux
[14:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: additional keystone overrides for cloud-private migration [puppet] - 10https://gerrit.wikimedia.org/r/954698 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[14:32:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:32:35] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you @Trizek-WMF, sounds good!  I will ping you regarding translations :-)
[14:32:49] <wikibugs>	 (03PS1) 10Majavah: cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699
[14:33:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah)
[14:35:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add temporary buster-based PHP7.4 icu67 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954700 (https://phabricator.wikimedia.org/T329491)
[14:38:29] <jinxer-wm>	 (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[14:42:40] <wikibugs>	 (03PS1) 10Ayounsi: Add MTU 9000 as valid option for NTT VPLS [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/954702 (https://phabricator.wikimedia.org/T336828)
[14:43:28] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms
[14:43:57] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Rebuild Java images to update to latest OpenJDK 11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954655 (owner: 10Muehlenhoff)
[14:44:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add MTU 9000 as valid option for NTT VPLS [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/954702 (https://phabricator.wikimedia.org/T336828) (owner: 10Ayounsi)
[14:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: Add MTU 9000 as valid option for NTT VPLS [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/954702 (https://phabricator.wikimedia.org/T336828) (owner: 10Ayounsi)
[14:44:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[14:45:06] <wikibugs>	 (03PS12) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[14:45:16] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[14:46:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/954703
[14:47:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[14:48:07] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff)
[14:51:29] <wikibugs>	 (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[14:53:30] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[14:54:07] <wikibugs>	 (03CR) 10Muehlenhoff: wikitech: Disable password resets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah)
[14:54:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, and matches the production ferm rules (i.e. traffic from prometheus host is allowed regardless of ports)" [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah)
[14:57:15] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: cloudlb: disable older designate backends [puppet] - 10https://gerrit.wikimedia.org/r/954704 (https://phabricator.wikimedia.org/T345240)
[14:57:44] <moritzm>	 !log installing json-c security updates
[14:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954704 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[14:59:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: cloudlb: disable older designate backends [puppet] - 10https://gerrit.wikimedia.org/r/954704 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[14:59:27] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10akosiaris) I 've uploaded changes for icu67 php7.4 images for use with a shellbox deployment. I 'll also create a temporary shellbox deployment based on those.
[15:03:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: set jaeger components services to production [puppet] - 10https://gerrit.wikimedia.org/r/954705 (https://phabricator.wikimedia.org/T344253)
[15:04:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) 05Open→03Resolved a:03ayounsi This is now working in prod.
[15:04:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi)
[15:07:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mesh: new networkpolicy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/954210 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[15:07:36] <wikibugs>	 (03PS2) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[15:07:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah)
[15:08:00] <wikibugs>	 (03PS3) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[15:08:17] <wikibugs>	 (03Merged) 10jenkins-bot: cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah)
[15:17:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Build production-images based on spark 3.3.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952476 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[15:21:02] <wikibugs>	 (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/954707
[15:24:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: allow cloudlb's haproxy connectivity [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240)
[15:25:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:26:10] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954709 (https://phabricator.wikimedia.org/T128546)
[15:26:15] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] cloudservices1006: allow cloudlb's haproxy connectivity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:26:19] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF)
[15:27:10] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices1006: allow cloudlb's haproxy connectivity [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240)
[15:27:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:27:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudservices1006: allow cloudlb's haproxy connectivity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:30:04] <jouncebot>	 jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1530). Please do the needful.
[15:30:46] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954709 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:31:27] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954709 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:33:29] <wikibugs>	 (03PS7) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563)
[15:34:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: allow cloudlb's haproxy connectivity [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:34:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/954707 (owner: 10Muehlenhoff)
[15:35:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: mesh: add tracing support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[15:36:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: override deisngate servers [puppet] - 10https://gerrit.wikimedia.org/r/954712 (https://phabricator.wikimedia.org/T345240)
[15:37:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954712 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:40:42] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:954709| Bumping portals to master (T128546)]] (duration: 07m 01s)
[15:40:45] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:43:24] <wikibugs>	 (03CR) 10Elukey: "Added some nits, most of the code is scaffolded so it should be fine. Do you install sextant to run the create_service.sh script right? If" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[15:43:58] <wikibugs>	 (03PS1) 10AikoChou: ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058)
[15:44:29] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF)
[15:46:56] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:954709| Bumping portals to master (T128546)]] (duration: 06m 14s)
[15:47:00] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:50:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: override deisngate servers [puppet] - 10https://gerrit.wikimedia.org/r/954712 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[15:55:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[15:56:51] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[15:57:35] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[16:05:53] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:06:46] <topranks>	 !log setting port 1/1/5 to speed 100G on cr1-codfw 
[16:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:28] <topranks>	 !log setting port 1/1/5 to speed 100G on cr2-codfw 
[16:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:48] <wikibugs>	 (03PS1) 10DDesouza: Deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954720 (https://phabricator.wikimedia.org/T345158)
[16:14:39] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:28:44] <wikibugs>	 (03PS1) 10FNegri: [toolsdb] Enable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/954722 (https://phabricator.wikimedia.org/T345450)
[16:33:46] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393)
[16:38:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[16:41:55] <wikibugs>	 (03PS1) 10Majavah: hieradata: add cloudservices1006 to all designate fw rules [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240)
[16:44:05] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43141/console" [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) (owner: 10Majavah)
[16:53:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:58:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1700)
[17:00:04] <jouncebot>	 ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1700).
[17:02:36] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle)
[17:04:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Could you kindly give me a hand with this @Vgutierrez whenever you have a spare moment?
[17:11:41] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Remove visualdiff client/server from testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/954682 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff)
[17:51:20] <icinga-wm>	 RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2023-09-04 16:39:05 (1068 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[18:00:46] <zabe>	 jouncebot: nowandnext
[18:00:46] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 59 minute(s)
[18:00:46] <jouncebot>	 In 2 hour(s) and 59 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T2100)
[18:02:50] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10kamila)
[18:05:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/954703 (owner: 10Muehlenhoff)
[18:06:56] <wikibugs>	 (03CR) 10Zabe: "I think this would cause some interwiki prefixes to be removed (like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/952984" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung)
[18:07:38] <wikibugs>	 (03Abandoned) 10Zabe: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk)
[18:14:34] <icinga-wm>	 PROBLEM - Host mw2448 is DOWN: PING CRITICAL - Packet loss = 100%
[18:20:02] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:30:07] <wikibugs>	 (03PS1) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739
[18:34:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[18:53:05] <wikibugs>	 (03CR) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung)
[18:59:39] <wikibugs>	 (03CR) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung)
[19:08:56] <icinga-wm>	 RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1225) taken on 2023-09-04 18:27:37 (370 GiB, -0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:09:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:14:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:59:14] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:59:24] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:59:36] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[20:00:20] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[20:00:48] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[20:05:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[20:05:54] <icinga-wm>	 PROBLEM - grafana.wikimedia.org on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[20:06:32] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[20:06:42] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[20:12:47] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 565 bytes in 7.237 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[20:12:47] <icinga-wm>	 RECOVERY - grafana.wikimedia.org on grafana1002 is OK: HTTP OK: HTTP/1.1 200 OK - 128346 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[20:12:47] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T2100).
[21:11:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:42:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:00:32] <wikibugs>	 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) We've wrapped up testing on this for the moment, and we're fairly happy that it's where we want to go in the future. We're going to hold off until a little later in the FY...
[22:23:58] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:58:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down