[00:00:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T352010)', diff saved to https://phabricator.wikimedia.org/P62289 and previous config saved to /var/cache/conftool/dbconfig/20240512-000040-ladsgroup.json [00:00:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:00:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:00:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:01:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P62290 and previous config saved to /var/cache/conftool/dbconfig/20240512-000104-ladsgroup.json [00:02:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030549 (owner: 10TrainBranchBot) [00:19:30] FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:24:30] RESOLVED: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:26] (03CR) 10Bartosz Dziewoński: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński) [02:14:44] (03PS3) 10Bartosz Dziewoński: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) [02:14:57] (03PS2) 10Bartosz Dziewoński: Use ConditionalUserOptions for "discussiontools-autotopicsub" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) [02:15:22] (03PS3) 10Bartosz Dziewoński: Use ConditionalUserOptions for "discussiontools-autotopicsub" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) [02:36:29] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:35] PROBLEM - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2024-05-09 02:34:28 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:48:47] PROBLEM - snapshot of s8 in eqiad on backupmon1001 is CRITICAL: snapshot for s8 at eqiad (db1171) taken more than 3 days ago: Most recent backup 2024-05-09 03:25:38 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:57:37] PROBLEM - snapshot of s8 in codfw on backupmon1001 is CRITICAL: snapshot for s8 at codfw (db2098) taken more than 3 days ago: Most recent backup 2024-05-09 03:26:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:27:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:29:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P62291 and previous config saved to /var/cache/conftool/dbconfig/20240512-064011-ladsgroup.json [06:40:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 968.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:55:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P62292 and previous config saved to /var/cache/conftool/dbconfig/20240512-065519-ladsgroup.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240512T0700) [07:00:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 942.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:10:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P62293 and previous config saved to /var/cache/conftool/dbconfig/20240512-071026-ladsgroup.json [07:12:37] PROBLEM - snapshot of s5 in codfw on backupmon1001 is CRITICAL: snapshot for s5 at codfw (db2201) taken more than 3 days ago: Most recent backup 2024-05-09 06:43:04 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:18:59] PROBLEM - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: snapshot for s7 at eqiad (db1171) taken more than 3 days ago: Most recent backup 2024-05-09 07:10:58 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:25:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P62294 and previous config saved to /var/cache/conftool/dbconfig/20240512-072534-ladsgroup.json [07:25:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [07:25:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:25:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [07:25:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T352010)', diff saved to https://phabricator.wikimedia.org/P62295 and previous config saved to /var/cache/conftool/dbconfig/20240512-072559-ladsgroup.json [07:54:33] PROBLEM - snapshot of s7 in codfw on backupmon1001 is CRITICAL: snapshot for s7 at codfw (db2098) taken more than 3 days ago: Most recent backup 2024-05-09 07:34:33 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [08:39:47] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:39:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:45:59] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:57] FIRING: [13x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:16:03] hi [09:16:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [09:16:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:16:45] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1010.eqiad.wmnet, parse1013.eqiad.wmnet, kubernetes1041.eqiad.wmnet, mw1367.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1433.eqiad.wmnet, mw1479.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1388.eqiad.wmnet, mw1399.eqiad.wmnet, mw1435.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eq [09:16:45] t, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1369.eqiad.wmnet, kubernetes1059.eqiad.wmnet, kubernetes1005.eqiad.wmnet, mw1486.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1483.eqiad.wmnet, mw1458.eqiad.wmnet, mw1371.eqiad.wmnet, parse1012.eqiad.wmnet, mw1453.eqiad.wmnet, mw1431.eqiad.wmnet, kubernetes1028.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kube [09:16:45] 19.eqiad.wmnet, kubernetes1031.eqiad.wmnet, mw1381.eqiad.wmnet, kubernetes1042.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, parse1006.eqiad.wmnet, parse1003.eqiad.wmnet, kuberne https://wikitech.wikimedia.org/wiki/PyBal [09:16:53] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers parse1013.eqiad.wmnet, mw1492.eqiad.wmnet, kubernetes1041.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1386.eqiad.wmnet, mw1433.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1415.eqiad.wmnet, kubernetes1038.eqiad.wmnet, mw1435.eqiad.wmnet, mw1424.eqiad.wmnet, mw1488.eqiad.wmnet, parse1010.eqiad.wmnet, p [09:16:53] .eqiad.wmnet, mw1370.eqiad.wmnet, mw1395.eqiad.wmnet, mw1465.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1466.eqiad.wmnet, mw1483.eqiad.wmnet, mw1369.eqiad.wmnet, kubernetes1059.eqiad.wmnet, mw1469.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1360.eqiad.wmnet, mw1356.eqiad.wmnet, mw1458.eqiad.wmnet, kubernetes1048.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1031.eqiad.wmnet, kubernetes1024.eqiad.wmnet, mw1464.eqiad.wmnet, kubernetes10 [09:16:53] .wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, parse1022.eqiad.wmnet, kubernetes1032.eqiad.wmnet, kubernetes1026.eqiad.wmnet, mw1409.eqiad.wmnet, mw1383.eqiad.wmnet https://wikitech.wikimedia.org/wiki/PyBal [09:17:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:18:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:18:53] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:19:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.4069740753934648s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:19:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 8.078s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:20:57] RESOLVED: [17x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:21:09] !incidents [09:21:09] 4668 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [09:21:09] 4669 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:21:10] 4667 (RESOLVED) [13x] ProbeDown sre (probes/service) [09:21:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [09:21:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:21:50] !incidents [09:21:50] 4669 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:21:50] 4668 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [09:21:50] 4667 (RESOLVED) [13x] ProbeDown sre (probes/service) [09:22:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:23:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:23:30] appserver errors and response time down again. Probably some wikikube/bgp/networking issue in eqiad? [09:24:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.4069740753934648s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:24:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 7.434s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:18:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:52] (03CR) 10Tacsipacsi: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński) [11:30:48] (03CR) 10Aklapper: [C:03+2] Phabricator: Delete chatlog group [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [11:31:27] (03CR) 10Aklapper: [V:03+2 C:03+2] Phabricator: Delete chatlog group [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [11:37:35] (03PS1) 10Gergő Tisza: debug: Enable Special:WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) [12:14:39] (03PS1) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [12:23:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:55:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T352010)', diff saved to https://phabricator.wikimedia.org/P62296 and previous config saved to /var/cache/conftool/dbconfig/20240512-125539-ladsgroup.json [12:55:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:10:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P62297 and previous config saved to /var/cache/conftool/dbconfig/20240512-131046-ladsgroup.json [13:18:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:25:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P62298 and previous config saved to /var/cache/conftool/dbconfig/20240512-132554-ladsgroup.json [13:41:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T352010)', diff saved to https://phabricator.wikimedia.org/P62299 and previous config saved to /var/cache/conftool/dbconfig/20240512-134101-ladsgroup.json [13:41:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [13:41:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:41:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [13:41:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62300 and previous config saved to /var/cache/conftool/dbconfig/20240512-134125-ladsgroup.json [13:52:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:52:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:53:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:22] (03PS3) 10MdsShakil: Allow English Wikiversity custodians to use mass-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030313 (https://phabricator.wikimedia.org/T360977) [13:56:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:56:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:56:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51924 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:36:29] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:11] (03CR) 10Jdlrobson: [C:03+1] Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [15:52:35] (03PS12) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [15:52:43] (03CR) 10Pppery: "Really my fault for posting so many interdependent patches at once. I've adopted the approach of waiting for my patches to be merged befor" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [17:18:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:27] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86647640 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:30:27] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 36856 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62301 and previous config saved to /var/cache/conftool/dbconfig/20240512-190629-ladsgroup.json [19:06:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:21:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P62302 and previous config saved to /var/cache/conftool/dbconfig/20240512-192137-ladsgroup.json [19:36:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P62303 and previous config saved to /var/cache/conftool/dbconfig/20240512-193645-ladsgroup.json [19:51:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62304 and previous config saved to /var/cache/conftool/dbconfig/20240512-195156-ladsgroup.json [19:51:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:52:01] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:52:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:52:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T352010)', diff saved to https://phabricator.wikimedia.org/P62305 and previous config saved to /var/cache/conftool/dbconfig/20240512-195220-ladsgroup.json [21:18:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:12] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown