[00:01:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82125 and previous config saved to /var/cache/conftool/dbconfig/20250830-000136-ladsgroup.json [00:01:42] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:07:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183205 [00:08:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183205 (owner: 10TrainBranchBot) [00:14:35] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:16:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P82126 and previous config saved to /var/cache/conftool/dbconfig/20250830-001644-ladsgroup.json [00:29:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:31:10] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183205 (owner: 10TrainBranchBot) [00:31:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P82127 and previous config saved to /var/cache/conftool/dbconfig/20250830-003151-ladsgroup.json [00:43:40] RESOLVED: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82128 and previous config saved to /var/cache/conftool/dbconfig/20250830-004659-ladsgroup.json [00:47:05] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:47:15] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [00:54:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:14:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [01:14:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T402925)', diff saved to https://phabricator.wikimedia.org/P82129 and previous config saved to /var/cache/conftool/dbconfig/20250830-011446-ladsgroup.json [01:14:52] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:21:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T402925)', diff saved to https://phabricator.wikimedia.org/P82130 and previous config saved to /var/cache/conftool/dbconfig/20250830-012158-ladsgroup.json [01:22:04] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:37:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P82131 and previous config saved to /var/cache/conftool/dbconfig/20250830-013705-ladsgroup.json [01:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:49:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:52:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P82132 and previous config saved to /var/cache/conftool/dbconfig/20250830-015213-ladsgroup.json [02:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:07:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T402925)', diff saved to https://phabricator.wikimedia.org/P82133 and previous config saved to /var/cache/conftool/dbconfig/20250830-020720-ladsgroup.json [02:07:27] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:07:37] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [02:07:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T402925)', diff saved to https://phabricator.wikimedia.org/P82134 and previous config saved to /var/cache/conftool/dbconfig/20250830-020744-ladsgroup.json [02:10:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [02:11:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T402925)', diff saved to https://phabricator.wikimedia.org/P82135 and previous config saved to /var/cache/conftool/dbconfig/20250830-021154-ladsgroup.json [02:27:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P82136 and previous config saved to /var/cache/conftool/dbconfig/20250830-022702-ladsgroup.json [02:42:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P82137 and previous config saved to /var/cache/conftool/dbconfig/20250830-024210-ladsgroup.json [02:51:09] (03PS15) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [02:57:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T402925)', diff saved to https://phabricator.wikimedia.org/P82138 and previous config saved to /var/cache/conftool/dbconfig/20250830-025717-ladsgroup.json [02:57:23] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:57:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [02:57:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T402925)', diff saved to https://phabricator.wikimedia.org/P82139 and previous config saved to /var/cache/conftool/dbconfig/20250830-025740-ladsgroup.json [03:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:13:58] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11133544 (10phaultfinder) [03:15:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T402925)', diff saved to https://phabricator.wikimedia.org/P82140 and previous config saved to /var/cache/conftool/dbconfig/20250830-031553-ladsgroup.json [03:16:00] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:17:46] (03PS1) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) [03:18:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11133548 (10phaultfinder) [03:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:31:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P82141 and previous config saved to /var/cache/conftool/dbconfig/20250830-033101-ladsgroup.json [03:42:09] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11133550 (10tfmorris) >>! In T400119#11133180, @Tgr wrote: >>>! In T400119#11133146, @tfmorris wrote: >> Note that returning a plain text error message to... [03:46:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P82142 and previous config saved to /var/cache/conftool/dbconfig/20250830-034609-ladsgroup.json [04:01:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T402925)', diff saved to https://phabricator.wikimedia.org/P82143 and previous config saved to /var/cache/conftool/dbconfig/20250830-040116-ladsgroup.json [04:01:23] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:01:32] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [04:01:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T402925)', diff saved to https://phabricator.wikimedia.org/P82144 and previous config saved to /var/cache/conftool/dbconfig/20250830-040139-ladsgroup.json [04:03:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T402925)', diff saved to https://phabricator.wikimedia.org/P82145 and previous config saved to /var/cache/conftool/dbconfig/20250830-040350-ladsgroup.json [04:15:45] is Phab down? [04:18:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P82146 and previous config saved to /var/cache/conftool/dbconfig/20250830-041858-ladsgroup.json [04:34:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P82147 and previous config saved to /var/cache/conftool/dbconfig/20250830-043406-ladsgroup.json [04:49:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T402925)', diff saved to https://phabricator.wikimedia.org/P82148 and previous config saved to /var/cache/conftool/dbconfig/20250830-044913-ladsgroup.json [04:49:19] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:49:29] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [04:49:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T402925)', diff saved to https://phabricator.wikimedia.org/P82149 and previous config saved to /var/cache/conftool/dbconfig/20250830-044936-ladsgroup.json [04:54:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T402925)', diff saved to https://phabricator.wikimedia.org/P82150 and previous config saved to /var/cache/conftool/dbconfig/20250830-045448-ladsgroup.json [04:54:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:09:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P82151 and previous config saved to /var/cache/conftool/dbconfig/20250830-050956-ladsgroup.json [05:21:15] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11133565 (10Zache) >>! In T402749#11123303, @Ladsgroup wrote: > I suggest something even more radical: Move CAL (and HotCat) to core I am not against this per... [05:25:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P82152 and previous config saved to /var/cache/conftool/dbconfig/20250830-052503-ladsgroup.json [05:33:07] Tamzin: The filtering is done above phabricator (due to phorges less than ideal anti vandal tools) and scraping issues, so captcha and user groups are hard to bring into the logic [05:36:58] i figured something like that. still, does not seem ideal... could a more descriptive error be served at least? [05:40:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T402925)', diff saved to https://phabricator.wikimedia.org/P82153 and previous config saved to /var/cache/conftool/dbconfig/20250830-054011-ladsgroup.json [05:40:17] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:40:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [05:40:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T402925)', diff saved to https://phabricator.wikimedia.org/P82154 and previous config saved to /var/cache/conftool/dbconfig/20250830-054034-ladsgroup.json [05:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:02:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T402925)', diff saved to https://phabricator.wikimedia.org/P82155 and previous config saved to /var/cache/conftool/dbconfig/20250830-060459-ladsgroup.json [06:05:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:20:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P82156 and previous config saved to /var/cache/conftool/dbconfig/20250830-062007-ladsgroup.json [06:35:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P82157 and previous config saved to /var/cache/conftool/dbconfig/20250830-063515-ladsgroup.json [06:50:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T402925)', diff saved to https://phabricator.wikimedia.org/P82158 and previous config saved to /var/cache/conftool/dbconfig/20250830-065023-ladsgroup.json [06:50:29] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:50:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1253.eqiad.wmnet with reason: Maintenance [06:50:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T402925)', diff saved to https://phabricator.wikimedia.org/P82159 and previous config saved to /var/cache/conftool/dbconfig/20250830-065046-ladsgroup.json [06:53:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T402925)', diff saved to https://phabricator.wikimedia.org/P82160 and previous config saved to /var/cache/conftool/dbconfig/20250830-065358-ladsgroup.json [07:02:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:09:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P82161 and previous config saved to /var/cache/conftool/dbconfig/20250830-070906-ladsgroup.json [07:17:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:18:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11133623 (10phaultfinder) [07:22:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:23:55] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11133624 (10phaultfinder) [07:24:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P82162 and previous config saved to /var/cache/conftool/dbconfig/20250830-072414-ladsgroup.json [07:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:39:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T402925)', diff saved to https://phabricator.wikimedia.org/P82163 and previous config saved to /var/cache/conftool/dbconfig/20250830-073921-ladsgroup.json [07:39:27] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [07:39:37] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [07:39:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [07:39:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T402925)', diff saved to https://phabricator.wikimedia.org/P82164 and previous config saved to /var/cache/conftool/dbconfig/20250830-073953-ladsgroup.json [08:07:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T402925)', diff saved to https://phabricator.wikimedia.org/P82165 and previous config saved to /var/cache/conftool/dbconfig/20250830-080726-ladsgroup.json [08:07:32] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:22:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P82166 and previous config saved to /var/cache/conftool/dbconfig/20250830-082233-ladsgroup.json [08:37:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P82167 and previous config saved to /var/cache/conftool/dbconfig/20250830-083741-ladsgroup.json [08:52:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T402925)', diff saved to https://phabricator.wikimedia.org/P82168 and previous config saved to /var/cache/conftool/dbconfig/20250830-085248-ladsgroup.json [08:52:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:53:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [08:53:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T402925)', diff saved to https://phabricator.wikimedia.org/P82169 and previous config saved to /var/cache/conftool/dbconfig/20250830-085311-ladsgroup.json [09:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: ... [09:02:16] eqiad mw-wikifunctions releases routed via group1 (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-release=group1 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:02:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: ... [09:07:15] eqiad mw-wikifunctions releases routed via group1 (k8s) 2.312s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-release=group1 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:20:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T402925)', diff saved to https://phabricator.wikimedia.org/P82170 and previous config saved to /var/cache/conftool/dbconfig/20250830-092044-ladsgroup.json [09:20:51] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:22:08] (03PS1) 10D3r1ck01: SUL3: Use `metawiki` as central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) [09:35:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P82171 and previous config saved to /var/cache/conftool/dbconfig/20250830-093552-ladsgroup.json [09:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:50:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P82172 and previous config saved to /var/cache/conftool/dbconfig/20250830-095059-ladsgroup.json [10:02:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T402925)', diff saved to https://phabricator.wikimedia.org/P82173 and previous config saved to /var/cache/conftool/dbconfig/20250830-100606-ladsgroup.json [10:06:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:06:23] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [10:06:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T402925)', diff saved to https://phabricator.wikimedia.org/P82174 and previous config saved to /var/cache/conftool/dbconfig/20250830-100630-ladsgroup.json [10:26:27] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:27:27] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:32:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:33:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T402925)', diff saved to https://phabricator.wikimedia.org/P82175 and previous config saved to /var/cache/conftool/dbconfig/20250830-103338-ladsgroup.json [10:33:44] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:48:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P82176 and previous config saved to /var/cache/conftool/dbconfig/20250830-104845-ladsgroup.json [10:50:02] (03CR) 10Gergő Tisza: "The dependency should go in the other direction - a config change is no-op while the config is not yet used in code, but we don't want to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [11:02:33] (03PS2) 10D3r1ck01: SUL3: Use `metawiki` as central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) [11:02:46] (03CR) 10D3r1ck01: "Ack! Apologies, not sure how I missed that 😞" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [11:03:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P82177 and previous config saved to /var/cache/conftool/dbconfig/20250830-110353-ladsgroup.json [11:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:19:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T402925)', diff saved to https://phabricator.wikimedia.org/P82178 and previous config saved to /var/cache/conftool/dbconfig/20250830-111900-ladsgroup.json [11:19:06] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:19:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:19:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T402925)', diff saved to https://phabricator.wikimedia.org/P82179 and previous config saved to /var/cache/conftool/dbconfig/20250830-111913-ladsgroup.json [11:23:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11133717 (10phaultfinder) [11:28:56] (03CR) 10Gergő Tisza: [C:03+1] SUL3: Use `metawiki` as central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [11:28:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11133719 (10phaultfinder) [11:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:32:58] 10SRE-swift-storage, 06Commons: HTTP 404 / File not found errors for three images in one category not showing - https://phabricator.wikimedia.org/T403314#11133724 (10Aklapper) Please tag such issues with either #sre-swift-storage (file itself) or #thumbor (thumbnail only), as the #Commons community itself cann... [11:41:01] 10SRE-swift-storage, 06Commons: HTTP 404 / File not found errors for three images in one category - https://phabricator.wikimedia.org/T403314#11133729 (10Pigsonthewing) [11:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:47:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T402925)', diff saved to https://phabricator.wikimedia.org/P82180 and previous config saved to /var/cache/conftool/dbconfig/20250830-114726-ladsgroup.json [11:47:33] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:52:06] (03CR) 10Gergő Tisza: session: Enable MultiBackendSessionStore on `group0` wikis only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [12:02:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P82181 and previous config saved to /var/cache/conftool/dbconfig/20250830-120234-ladsgroup.json [12:17:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P82182 and previous config saved to /var/cache/conftool/dbconfig/20250830-121741-ladsgroup.json [12:32:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T402925)', diff saved to https://phabricator.wikimedia.org/P82183 and previous config saved to /var/cache/conftool/dbconfig/20250830-123249-ladsgroup.json [12:32:55] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:33:05] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [12:57:14] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance [13:06:27] (03CR) 10Krinkle: [C:03+1] CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [13:21:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [13:21:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T402925)', diff saved to https://phabricator.wikimedia.org/P82185 and previous config saved to /var/cache/conftool/dbconfig/20250830-132148-ladsgroup.json [13:21:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:46:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T402925)', diff saved to https://phabricator.wikimedia.org/P82186 and previous config saved to /var/cache/conftool/dbconfig/20250830-134604-ladsgroup.json [13:46:10] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:01:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P82187 and previous config saved to /var/cache/conftool/dbconfig/20250830-140112-ladsgroup.json [14:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P82188 and previous config saved to /var/cache/conftool/dbconfig/20250830-141619-ladsgroup.json [14:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:31:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T402925)', diff saved to https://phabricator.wikimedia.org/P82189 and previous config saved to /var/cache/conftool/dbconfig/20250830-143127-ladsgroup.json [14:31:33] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:31:43] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [14:31:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2218 (T402925)', diff saved to https://phabricator.wikimedia.org/P82190 and previous config saved to /var/cache/conftool/dbconfig/20250830-143150-ladsgroup.json [14:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:56:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T402925)', diff saved to https://phabricator.wikimedia.org/P82191 and previous config saved to /var/cache/conftool/dbconfig/20250830-145606-ladsgroup.json [14:56:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P82192 and previous config saved to /var/cache/conftool/dbconfig/20250830-151113-ladsgroup.json [15:21:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:26:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P82193 and previous config saved to /var/cache/conftool/dbconfig/20250830-152621-ladsgroup.json [15:28:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11133869 (10phaultfinder) [15:29:36] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:33:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11133874 (10phaultfinder) [15:41:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T402925)', diff saved to https://phabricator.wikimedia.org/P82194 and previous config saved to /var/cache/conftool/dbconfig/20250830-154128-ladsgroup.json [15:41:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:41:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance [15:41:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82195 and previous config saved to /var/cache/conftool/dbconfig/20250830-154151-ladsgroup.json [15:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:56:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:05:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:06:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82196 and previous config saved to /var/cache/conftool/dbconfig/20250830-160603-ladsgroup.json [16:06:10] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:19:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:21:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P82197 and previous config saved to /var/cache/conftool/dbconfig/20250830-162111-ladsgroup.json [16:36:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P82198 and previous config saved to /var/cache/conftool/dbconfig/20250830-163619-ladsgroup.json [16:51:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82199 and previous config saved to /var/cache/conftool/dbconfig/20250830-165126-ladsgroup.json [16:51:32] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:51:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Maintenance [16:51:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T402925)', diff saved to https://phabricator.wikimedia.org/P82200 and previous config saved to /var/cache/conftool/dbconfig/20250830-165149-ladsgroup.json [17:16:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T402925)', diff saved to https://phabricator.wikimedia.org/P82201 and previous config saved to /var/cache/conftool/dbconfig/20250830-171605-ladsgroup.json [17:16:11] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [17:16:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:21:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P82202 and previous config saved to /var/cache/conftool/dbconfig/20250830-173113-ladsgroup.json [17:42:33] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11134002 (10DavidBrooks) Re-upping a question I had earlier - will the servers' "Retry-After" header use seconds, or http-date, or potentially either? Of c... [17:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:46:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P82203 and previous config saved to /var/cache/conftool/dbconfig/20250830-174620-ladsgroup.json [18:01:11] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11134004 (10Vgutierrez) >>! In T400119#11134002, @DavidBrooks wrote: > Re-upping a question I had earlier - will the servers' "Retry-After" header use seco... [18:01:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T402925)', diff saved to https://phabricator.wikimedia.org/P82204 and previous config saved to /var/cache/conftool/dbconfig/20250830-180128-ladsgroup.json [18:01:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [18:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:05] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [18:31:12] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:31:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T402925)', diff saved to https://phabricator.wikimedia.org/P82205 and previous config saved to /var/cache/conftool/dbconfig/20250830-183119-ladsgroup.json [18:31:25] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [18:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:02:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T402925)', diff saved to https://phabricator.wikimedia.org/P82206 and previous config saved to /var/cache/conftool/dbconfig/20250830-190228-ladsgroup.json [19:02:35] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [19:04:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:17:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P82207 and previous config saved to /var/cache/conftool/dbconfig/20250830-191736-ladsgroup.json [19:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:32:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P82208 and previous config saved to /var/cache/conftool/dbconfig/20250830-193244-ladsgroup.json [19:33:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134047 (10phaultfinder) [19:38:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134059 (10phaultfinder) [19:47:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T402925)', diff saved to https://phabricator.wikimedia.org/P82209 and previous config saved to /var/cache/conftool/dbconfig/20250830-194751-ladsgroup.json [19:47:57] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [19:48:07] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [19:48:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T402925)', diff saved to https://phabricator.wikimedia.org/P82210 and previous config saved to /var/cache/conftool/dbconfig/20250830-194814-ladsgroup.json [19:49:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:50:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T402925)', diff saved to https://phabricator.wikimedia.org/P82211 and previous config saved to /var/cache/conftool/dbconfig/20250830-195026-ladsgroup.json [20:05:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P82212 and previous config saved to /var/cache/conftool/dbconfig/20250830-200533-ladsgroup.json [20:07:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:12:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:20:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P82213 and previous config saved to /var/cache/conftool/dbconfig/20250830-202041-ladsgroup.json [20:26:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:35:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T402925)', diff saved to https://phabricator.wikimedia.org/P82214 and previous config saved to /var/cache/conftool/dbconfig/20250830-203548-ladsgroup.json [20:35:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [20:36:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:36:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T402925)', diff saved to https://phabricator.wikimedia.org/P82215 and previous config saved to /var/cache/conftool/dbconfig/20250830-203611-ladsgroup.json [21:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T402925)', diff saved to https://phabricator.wikimedia.org/P82216 and previous config saved to /var/cache/conftool/dbconfig/20250830-210554-ladsgroup.json [21:06:01] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:19:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:21:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P82217 and previous config saved to /var/cache/conftool/dbconfig/20250830-212101-ladsgroup.json [21:34:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:36:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P82218 and previous config saved to /var/cache/conftool/dbconfig/20250830-213609-ladsgroup.json [21:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:47:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:51:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T402925)', diff saved to https://phabricator.wikimedia.org/P82219 and previous config saved to /var/cache/conftool/dbconfig/20250830-215116-ladsgroup.json [21:51:22] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:51:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [21:51:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T402925)', diff saved to https://phabricator.wikimedia.org/P82220 and previous config saved to /var/cache/conftool/dbconfig/20250830-215138-ladsgroup.json [21:53:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T402925)', diff saved to https://phabricator.wikimedia.org/P82221 and previous config saved to /var/cache/conftool/dbconfig/20250830-215351-ladsgroup.json [22:02:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P82222 and previous config saved to /var/cache/conftool/dbconfig/20250830-220859-ladsgroup.json [22:24:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P82223 and previous config saved to /var/cache/conftool/dbconfig/20250830-222406-ladsgroup.json [22:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:33:03] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:34:03] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:39:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T402925)', diff saved to https://phabricator.wikimedia.org/P82224 and previous config saved to /var/cache/conftool/dbconfig/20250830-223914-ladsgroup.json [22:39:20] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:39:29] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:39:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T402925)', diff saved to https://phabricator.wikimedia.org/P82225 and previous config saved to /var/cache/conftool/dbconfig/20250830-223936-ladsgroup.json [22:41:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T402925)', diff saved to https://phabricator.wikimedia.org/P82226 and previous config saved to /var/cache/conftool/dbconfig/20250830-224149-ladsgroup.json [22:56:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P82227 and previous config saved to /var/cache/conftool/dbconfig/20250830-225656-ladsgroup.json [23:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:12:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P82228 and previous config saved to /var/cache/conftool/dbconfig/20250830-231204-ladsgroup.json [23:27:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T402925)', diff saved to https://phabricator.wikimedia.org/P82229 and previous config saved to /var/cache/conftool/dbconfig/20250830-232712-ladsgroup.json [23:27:18] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:27:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [23:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:38:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183241 [23:38:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183241 (owner: 10TrainBranchBot) [23:38:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134216 (10phaultfinder) [23:43:59] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134217 (10phaultfinder) [23:51:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183241 (owner: 10TrainBranchBot) [23:55:15] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [23:55:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T402925)', diff saved to https://phabricator.wikimedia.org/P82230 and previous config saved to /var/cache/conftool/dbconfig/20250830-235521-ladsgroup.json [23:55:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:57:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T402925)', diff saved to https://phabricator.wikimedia.org/P82231 and previous config saved to /var/cache/conftool/dbconfig/20250830-235735-ladsgroup.json