[00:56:22] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:42] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:14] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2623.13 ms [01:42:10] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:43:06] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [01:44:12] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:45:06] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [01:46:50] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.59 ms [01:47:22] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 315.47 ms [02:06:39] (03PS1) 10Huji: Specify the default language of beta cluster votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) [02:17:13] (03CR) 10Reedy: Specify the default language of beta cluster votewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [04:17:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:21:40] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 794 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:30:21] the Mailman bounce queue is going to be in a bad state for a bit, someone spammed the wikipedia-l list, which has a bunch of dead emails in the subscriber list [06:33:30] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 50.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:37:40] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:44:38] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211107T0700) [08:25:04] (03PS1) 10Amire80: Add https://ferdinando.me to the Italian planet [puppet] - 10https://gerrit.wikimedia.org/r/737185 [08:40:19] (03PS1) 10Amire80: A more focused feed for lu.is for the Wikimedia Planet [puppet] - 10https://gerrit.wikimedia.org/r/737186 [08:43:57] (03PS7) 10Amire80: Update autonyms in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) [08:55:47] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737187 (owner: 10Awight) [08:56:34] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737188 (owner: 10Awight) [08:58:20] (03CR) 10Awight: "I believe the spelling should be "cacheable", but better to stay consistent with the function call. This temporary variable doesn't incre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737189 (owner: 10Awight) [09:02:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Looks correct to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80) [09:55:43] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737192 (owner: 10Awight) [09:56:41] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight) [10:05:02] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10LClightcat) @Legoktm Well....to avoid you not noticing, I ping you.(I'm sorry to disturb you) I would like to know whether the reasons I submitted will be accepted by SRE? Or w... [10:25:10] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:56] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [11:15:12] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:37:04] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:49:18] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.40 ms [12:23:55] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 (owner: 10Awight) [13:19:53] (03PS1) 10Majavah: puppetmaster: delete labs-root-password [puppet] - 10https://gerrit.wikimedia.org/r/737199 [13:51:33] (03PS1) 10Majavah: P::kubernetes: allow disabling kafka ipv6 on hiera [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) [13:51:45] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah) [13:52:53] (03PS2) 10Majavah: P::kubernetes: allow disabling kafka ipv6 on hiera [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) [13:53:02] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah) [13:53:50] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 57.96 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:55:56] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 88.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:00:26] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:58] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:04:39] (03PS2) 10JMeybohm: Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) [14:11:36] (03PS2) 10Huji: Specify the default language of beta cluster votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) [14:11:40] (03CR) 10Huji: Specify the default language of beta cluster votewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [14:21:09] (03Abandoned) 10Majavah: P::kubernetes::deployment_server: Do not use ipv6 on beta [puppet] - 10https://gerrit.wikimedia.org/r/691494 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah) [14:21:21] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah) [14:24:18] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 (10Majavah) p:05Triage→03High [14:30:36] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:35:41] (03PS1) 10JMeybohm: Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203 [14:36:44] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [14:42:49] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations: Add memcached to mwmaint01 using puppet - https://phabricator.wikimedia.org/T240263 (10Majavah) 05Open→03Resolved mwmaint02 was created recently and didn't need this procedure. [17:27:25] 10SRE, 10MediaWiki-Maintenance-system, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Majavah) 05Open→03Resolved a:03Andrew [17:28:26] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Reedy) [17:28:42] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10Majavah) [17:29:05] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Wikitech and wikitech-static out of sync - https://phabricator.wikimedia.org/T292342 (10Reedy) 05Open→03Resolved [17:29:17] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10Reedy) [17:33:33] 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Majavah) [19:00:40] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:12] PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:10:14] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:10:34] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) [19:12:30] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:14:24] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:30] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:16:22] RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 392.52 ms [19:16:44] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 411.59 ms [19:31:10] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:18] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [20:06:04] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:26] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [20:14:44] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:17:00] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:20:54] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms [20:29:26] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:35:34] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.65 ms [20:36:35] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737209 (owner: 10Awight) [20:47:38] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210 (owner: 10Awight) [20:54:25] (03PS2) 10Awight: Extract reused dblists code into function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210 [21:02:44] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:12] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:45:36] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [21:49:56] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737212 (owner: 10Awight) [22:28:54] (03PS3) 10Juan90264: Add enwikibooks in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737081 (https://phabricator.wikimedia.org/T295051) [22:31:54] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:23:24] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1206.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:24:48] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:37:26] RECOVERY - snapshot of s4 in eqiad on alert1001 is OK: Last snapshot for s4 at eqiad (db1150.eqiad.wmnet:3314) taken on 2021-11-07 21:25:42 (1559 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:37:26] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:45:52] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:52:12] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica