[00:38:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949225 [00:38:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949225 (owner: 10TrainBranchBot) [00:42:20] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:46] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949225 (owner: 10TrainBranchBot) [01:51:52] (03CR) 10Ladsgroup: [C: 03+1] "I'll deploy this later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950132 (owner: 10Zabe) [02:06:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:49] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:31:40] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:11:04] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:12:24] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:05:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:10:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (bast3007), Fresh: 128 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:31:10] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [04:58:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [05:11:16] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:31:48] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:45:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:50] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230819T0700) [07:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:44:51] (03PS1) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_repository db [puppet] - 10https://gerrit.wikimedia.org/r/950683 (https://phabricator.wikimedia.org/T344513) [07:45:12] 10SRE, 10SRE-Access-Requests, 10DBA, 10Patch-For-Review: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Aklapper) 05Stalled→03Open [08:05:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:06:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:19:33] (03PS2) 10Zabe: manage-dblist: Add lang to langlist if not present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950132 [08:26:09] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 16 hosts with reason: Downtime esams hosts prior to cr1-esams reboot [08:26:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 16 hosts with reason: Downtime esams hosts prior to cr1-esams reboot [08:26:45] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7e93d2ff-5842-45cb-8ae6-d620952951e6) set by cmooney@cumin1001 for 2:00:00 on 16 host(s) and thei... [08:37:15] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-esams listing transport ospf interfaces multiple times - https://phabricator.wikimedia.org/T344546 (10cmooney) p:05Triage→03Medium [08:37:22] !log downtiming esams hosts ahead of core router (cr1-esams) reboot T344546 [08:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:26] T344546: cr1-esams listing transport ospf interfaces multiple times - https://phabricator.wikimedia.org/T344546 [08:38:04] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:09] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 29 hosts with reason: Downtime esams hosts prior to migration week. [08:38:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 29 hosts with reason: Downtime esams hosts prior to migration week. [08:38:47] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-esams listing transport ospf interfaces multiple times - https://phabricator.wikimedia.org/T344546 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9daa4455-f5dd-4a7a-8558-107a3e4842a2) set by cmooney@cumin1001 for 2:00:00 on 29 host(s) an... [08:42:28] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:04] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:18] RECOVERY - BFD status on cr1-esams is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:09:20] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:09:48] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:16:35] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-esams listing transport ospf interfaces multiple times - https://phabricator.wikimedia.org/T344546 (10cmooney) 05Open→03Resolved @ayounsi advised to try running a "commit full" on the router rather than a reboot. Apparently this command forces a full co... [09:33:48] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) p:05Triage→03Low [09:38:59] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) [09:40:05] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) [09:42:36] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) In terms of Pybal or Anycast routes etc. we can possibly not send those. If the overall approach seems good we can weigh up exactly what is useful to s... [09:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:51:30] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) This should also result in more optimal routing as switches will send traffic to the CR with the best path to the remote internal destinations. i.e. if... [10:14:51] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:35:46] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:42] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [12:06:45] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:38:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:24] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:13:28] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:13:34] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:14:57] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:16:40] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:52:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:53:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:53:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:54:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:48:37] topranks: the cr1-esams alert is just expired? [17:48:50] cr1-esams page was it firing again after 24h - everything seems ok I’ll do a quick check in an hour or so when I’m at my computer [17:49:13] jelto: yes, acked yesterday and wasn’t resolved after so firing again [17:49:14] Great thanks [17:49:18] Sorry for the noise [17:49:23] np [17:50:39] I can confirm, alert fired yesterday Aug. 18 - 7:41 PM, so 24h ago [18:15:06] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:34:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:48:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:55:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:55:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:11:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:15:11] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed