[00:02:47] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[00:21:41] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:40:35] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:46:59] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:08:07] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:30:39] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:39:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:11] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:48:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:41] <icinga-wm>	 PROBLEM - puppet last run on gitlab2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:54:19] <tto>	 Any reason why I might not be able to view the contributions of a particular bot? https://en.wiktionary.org/wiki/Special:Contributions/WingerBot
[01:54:30] <tto>	 [e5451e66-2d21-4344-8477-b8675e176632] 2022-05-15 01:53:34: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError"
[01:55:20] <AntiComposite>	 well that's not good
[01:56:28] <tto>	 It is a prolific bot. I wonder if it has made so many edits lately that the query planner has got confused
[01:56:31] <tto>	 Should I file it in Phab?
[01:56:47] <AntiComposite>	 yeah, I doubt there's anyone around on a Saturday night that can look further
[01:57:02] <p858snake>	 Lots of edits will trip the query timeout
[01:57:41] <p858snake>	 T307295
[01:57:41] <stashbot>	 T307295: Bot contributions page in Catalan wikipedia not displayed - https://phabricator.wikimedia.org/T307295
[02:00:10] <tto>	 Cheers p858snake
[02:14:43] <icinga-wm>	 RECOVERY - puppet last run on gitlab2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:27:33] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:27:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:43:51] <icinga-wm>	 RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:24:27] <legoktm>	 I missed tto :((
[03:38:12] <wikibugs>	 (03PS1) 10Stang: zhwikisource: Add NS100 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791768 (https://phabricator.wikimedia.org/T308393)
[03:51:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:00:53] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:40:49] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:42:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1057-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:05:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:13:05] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:24:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1177 - https://phabricator.wikimedia.org/T308387 (10Marostegui) 05Open→03Resolved a:03Marostegui The RAID is indeed not degraded: ` root@db1177:~# megacli -LDInfo -L0 -a0   Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name...
[05:27:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1057-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:33:19] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: 2022-05-14 Databases - https://phabricator.wikimedia.org/T308380 (10Marostegui) >>! In T308380#7928926, @Ladsgroup wrote: > s8 has so many replicas that I think its weights should rebalance automatically using some metrics like connection co...
[05:43:01] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: 2022-05-14 Databases - https://phabricator.wikimedia.org/T308380 (10Marostegui) I have been investigating db1172's graphs and performance and I haven't been able to find anything that could explain this all of a sudden issue. My theory is th...
[05:47:43] <icinga-wm>	 PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:25:09] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] "Looks good but after T307521 it will need something like" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791007 (https://phabricator.wikimedia.org/T305659) (owner: 10Sergio Gimeno)
[06:30:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Self note: dbproxy1021 isn't active
[06:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:43:15] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:46:37] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220515T0700)
[07:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:22:00] <wikibugs>	 (03PS9) 10Gergő Tisza: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno)
[07:22:02] <wikibugs>	 (03PS6) 10Gergő Tisza: Account creation: enable thankyoupage campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791007 (https://phabricator.wikimedia.org/T305659) (owner: 10Sergio Gimeno)
[07:22:04] <wikibugs>	 (03PS1) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791770 (https://phabricator.wikimedia.org/T307521)
[07:23:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791770 (https://phabricator.wikimedia.org/T307521) (owner: 10Gergő Tisza)
[07:40:00] <wikibugs>	 (03PS2) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791770 (https://phabricator.wikimedia.org/T307521)
[07:51:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:46:49] <icinga-wm>	 PROBLEM - puppet last run on thanos-fe2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:50:20] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:58:17] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:45:15] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 53.65 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[10:47:31] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[11:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:51:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:49:03] <icinga-wm>	 PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[12:50:01] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:09:31] <icinga-wm>	 RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[13:10:29] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:01] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:16:21] <XioNoX>	 looks like a spike
[13:16:49] <jayme>	 i'm around (but in a train unfortunately)
[13:19:01] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:20:44] <godog>	 mmhh indeed, we can tweak the alert for a longer duration too if needed
[13:21:20] <XioNoX>	 godog: is turnilo loading for you https://turnilo.wikimedia.org/#wmf_netflow ?
[13:21:54] * volans here too
[13:21:57] <godog>	 XioNoX: it is not
[13:21:59] <volans>	 anything I can do to help?
[13:22:04] <jbond>	 looks llike there was a peak in tcp aborted in FR, wonder if we had a plip in drmrs
[13:22:31] <jbond>	 but i would say look slike a blip and not mutch to worry about
[13:22:42] <XioNoX>	 tcp timeout from many locations too (IR, CN)
[13:22:50] <XioNoX>	 yeah agreed
[13:23:26] <XioNoX>	 we had a spike of traffic I'd like to have a closer look but with turnilo down...
[13:23:45] <XioNoX>	 https://librenms.wikimedia.org/bill/bill_id=25/
[13:24:32] <volans>	 why is turnilo down? related or was already like that?
[13:24:54] <XioNoX>	 volans: maybe related to https://phabricator.wikimedia.org/T301990
[13:25:01] <XioNoX>	 but it's the first time I see it as down
[13:25:22] <volans>	 I'm having a quick look
[13:25:58] <volans>	 an-tool1007 is down according to icinga
[13:26:31] <XioNoX>	 volans: https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:47] <XioNoX>	 friday mignight UTC upgrade
[13:27:16] <volans>	 :/
[13:27:23] <XioNoX>	 and didn't go well
[13:27:29] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[13:28:24] <godog>	 going back afk, but otherwise available
[13:28:33] <jbond>	 XioNoX: volans: things look good here and i ws driving so i think ill leave and get back to that unless you need to to stick around
[13:28:56] <volans>	 jbond: go afk, nothing urgent at this time
[13:29:40] <XioNoX>	 yeah, I'm going back to my weekend too
[13:29:43] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[13:35:07] <jbond>	 ack cool later  enjoy your weekends
[13:49:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) We should be aware that changing the typology annotations also changes scheduling behavior. As of now, the scheduler w...
[14:03:21] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:56:25] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399)
[14:57:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399) (owner: 10Samtar)
[14:58:50] <wikibugs>	 (03PS2) 10Samtar: InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399)
[15:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:51:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:54:41] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:59:39] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:19] <icinga-wm>	 RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:10:35] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:31:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:35:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:44:23] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:53:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:53] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:14:56] <wikibugs>	 (03PS1) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165)
[17:15:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[17:30:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:34:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:49:04] <wikibugs>	 (03PS1) 10Stang: votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791797 (https://phabricator.wikimedia.org/T308397)
[17:56:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:03] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:49:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:54:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:51:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:43:17] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:52:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:00:57] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:05:33] <wikibugs>	 (03CR) 10Stang: bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya)
[21:13:44] <wikibugs>	 (03PS4) 10Yahya: bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904)
[21:14:29] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided)
[21:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:38] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided) (duration: 00m 08s)
[21:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:28] <wikibugs>	 (03CR) 104nn1l2: [C: 03+1] InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399) (owner: 10Samtar)
[21:30:06] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided)
[21:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:15] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided) (duration: 00m 08s)
[21:30:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[21:30:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:26] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided)
[21:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:34] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided) (duration: 00m 08s)
[21:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:17] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided)
[21:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:24] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided) (duration: 00m 07s)
[21:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:56] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided)
[21:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:03] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@378e7ca]: (no justification provided) (duration: 00m 07s)
[21:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:41] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:02:13] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:03:41] <wikibugs>	 (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya)
[22:05:27] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:42:25] <wikibugs>	 (03CR) 10Stang: bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya)
[22:47:07] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] Remove webperf1001/2001 from Scap config [puppet] - 10https://gerrit.wikimedia.org/r/791300 (owner: 10Muehlenhoff)
[23:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:46:55] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:51:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale