[00:56:22] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:40:42] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[01:41:14] <icinga-wm>	 PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2623.13 ms
[01:42:10] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:43:06] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[01:44:12] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:45:06] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[01:46:50] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.59 ms
[01:47:22] <icinga-wm>	 RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 315.47 ms
[02:06:39] <wikibugs>	 (03PS1) 10Huji: Specify the default language of beta cluster votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242)
[02:17:13] <wikibugs>	 (03CR) 10Reedy: Specify the default language of beta cluster votewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji)
[04:17:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:25:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:09:54] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:21:40] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 794 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[06:30:21] <legoktm>	 the Mailman bounce queue is going to be in a bad state for a bit, someone spammed the wikipedia-l list, which has a bunch of dead emails in the subscriber list
[06:33:30] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 50.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[06:37:40] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[06:44:38] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211107T0700)
[08:25:04] <wikibugs>	 (03PS1) 10Amire80: Add https://ferdinando.me to the Italian planet [puppet] - 10https://gerrit.wikimedia.org/r/737185
[08:40:19] <wikibugs>	 (03PS1) 10Amire80: A more focused feed for lu.is for the Wikimedia Planet [puppet] - 10https://gerrit.wikimedia.org/r/737186
[08:43:57] <wikibugs>	 (03PS7) 10Amire80: Update autonyms in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870)
[08:55:47] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737187 (owner: 10Awight)
[08:56:34] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737188 (owner: 10Awight)
[08:58:20] <wikibugs>	 (03CR) 10Awight: "I believe the spelling should be "cacheable", but better to stay consistent with the function call.  This temporary variable doesn't incre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737189 (owner: 10Awight)
[09:02:02] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Looks correct to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80)
[09:55:43] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737192 (owner: 10Awight)
[09:56:41] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight)
[10:05:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10LClightcat) @Legoktm Well....to avoid you not noticing, I ping you.(I'm sorry to disturb you) I would like to know whether the reasons I submitted will be accepted by SRE? Or w...
[10:25:10] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[10:49:56] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms
[11:15:12] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:37:04] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:49:18] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.40 ms
[12:23:55] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 (owner: 10Awight)
[13:19:53] <wikibugs>	 (03PS1) 10Majavah: puppetmaster: delete labs-root-password [puppet] - 10https://gerrit.wikimedia.org/r/737199
[13:51:33] <wikibugs>	 (03PS1) 10Majavah: P::kubernetes: allow disabling kafka ipv6 on hiera [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986)
[13:51:45] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah)
[13:52:53] <wikibugs>	 (03PS2) 10Majavah: P::kubernetes: allow disabling kafka ipv6 on hiera [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986)
[13:53:02] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah)
[13:53:50] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 57.96 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[13:55:56] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 88.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:00:26] <icinga-wm>	 PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:58] <icinga-wm>	 RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:04:39] <wikibugs>	 (03PS2) 10JMeybohm: Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560)
[14:11:36] <wikibugs>	 (03PS2) 10Huji: Specify the default language of beta cluster votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242)
[14:11:40] <wikibugs>	 (03CR) 10Huji: Specify the default language of beta cluster votewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji)
[14:21:09] <wikibugs>	 (03Abandoned) 10Majavah: P::kubernetes::deployment_server: Do not use ipv6 on beta [puppet] - 10https://gerrit.wikimedia.org/r/691494 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah)
[14:21:21] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah)
[14:24:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 (10Majavah) p:05Triage→03High
[14:30:36] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:41] <wikibugs>	 (03PS1) 10JMeybohm: Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203
[14:36:44] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms
[14:42:49] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations: Add memcached to mwmaint01 using puppet - https://phabricator.wikimedia.org/T240263 (10Majavah) 05Open→03Resolved mwmaint02 was created recently and didn't need this procedure.
[17:27:25] <wikibugs>	 10SRE, 10MediaWiki-Maintenance-system, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Majavah) 05Open→03Resolved a:03Andrew
[17:28:26] <wikibugs>	 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Reedy)
[17:28:42] <wikibugs>	 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10Majavah)
[17:29:05] <wikibugs>	 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Wikitech and wikitech-static out of sync - https://phabricator.wikimedia.org/T292342 (10Reedy) 05Open→03Resolved
[17:29:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10Reedy)
[17:33:33] <wikibugs>	 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Majavah)
[19:00:40] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:12] <icinga-wm>	 PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:10:14] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:10:34] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2)
[19:12:30] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[19:14:24] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:14:30] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[19:16:22] <icinga-wm>	 RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 392.52 ms
[19:16:44] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 411.59 ms
[19:31:10] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:18] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[20:06:04] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji)
[20:14:44] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:17:00] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:20:54] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms
[20:29:26] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:35:34] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.65 ms
[20:36:35] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737209 (owner: 10Awight)
[20:47:38] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210 (owner: 10Awight)
[20:54:25] <wikibugs>	 (03PS2) 10Awight: Extract reused dblists code into function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210
[21:02:44] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:33:12] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:36] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[21:49:56] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737212 (owner: 10Awight)
[22:28:54] <wikibugs>	 (03PS3) 10Juan90264: Add enwikibooks in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737081 (https://phabricator.wikimedia.org/T295051)
[22:31:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:23:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1206.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:24:48] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:37:26] <icinga-wm>	 RECOVERY - snapshot of s4 in eqiad on alert1001 is OK: Last snapshot for s4 at eqiad (db1150.eqiad.wmnet:3314) taken on 2021-11-07 21:25:42 (1559 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[23:37:26] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:45:52] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:52:12] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica