[00:00:10] !log bblack@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload and not P{cp7008.magru.wmnet} and A:cp - Upgrade wmfuniq to 0.3.0 () [00:00:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:02:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [00:09:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:14:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:17:45] (03PS1) 10Ryan Kemper: dse-k8s: bump opensearch-semantic-search mem quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300945 (https://phabricator.wikimedia.org/T426589) [00:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:24:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:47:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:52:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:58:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:03:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:04:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:08:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:14:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:24:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:33:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:38:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:39:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:43:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:46:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:49:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:54:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:58:25] !incidents [01:58:25] 8071 (RESOLVED) [6x] ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet) [02:00:20] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:01:34] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 01m 13s) [02:03:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:08:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:13:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:23:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:27:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [02:46:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:48:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:53:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:58:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:04:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:06:24] !log T427951 Deleted all 20 unused dev/test topics on kafka-jumbo (verified empty first); 2 (`[eqiad,codfw]page_html_content_change.rc0`) were immediately auto-recreated empty by a still-running `dse-k8s` enrichment consumer; awaiting owner confirmation before final re-delete [03:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:29] T427951: Delete some unused development topics on Kafka Jumbo - https://phabricator.wikimedia.org/T427951 [03:07:13] !log T427951 sorry, `[eqiad,codfw].mediawiki.page_html_content_change.rc0` (accidentally a word) [03:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:13:44] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 615.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:14:44] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:17:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:22:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:38:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:42:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:46:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:51:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:56:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:01:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:04:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:14:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:24:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:42:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:52:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:57:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:57:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:02:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:07:30] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:12:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:12:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:22:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:22:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:27:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:31:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12013111 (10Marostegui) I am going to leave the host depooled and downtimed for the weekend and will repool on Monday. [05:32:33] (03PS1) 10Marostegui: Revert "installserver: Format es2045 entirely" [puppet] - 10https://gerrit.wikimedia.org/r/1300952 [05:36:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:36:44] (03CR) 10Marostegui: [C:03+2] Revert "installserver: Format es2045 entirely" [puppet] - 10https://gerrit.wikimedia.org/r/1300952 (owner: 10Marostegui) [05:41:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:47:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:54:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:56:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:59:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260612T0600) [06:02:40] (03PS1) 10Muehlenhoff: sre.puppet.disable-merges: Remove unused code snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/1300954 [06:02:49] (03CR) 10CI reject: [V:04-1] sre.puppet.disable-merges: Remove unused code snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/1300954 (owner: 10Muehlenhoff) [06:04:37] (03PS2) 10Muehlenhoff: sre.puppet.disable-merges: Remove unused code snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/1300954 [06:05:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:08:50] (03CR) 10Muehlenhoff: [C:03+2] sre.puppet.disable-merges: Remove unused code snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/1300954 (owner: 10Muehlenhoff) [06:11:56] !log jmm@cumin2002 START - Cookbook sre.puppet.disable-merges [06:13:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.disable-merges (exit_code=0) [06:13:31] (03CR) 10Muehlenhoff: [C:03+2] aptrepo: Add Routinator for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1294933 (owner: 10Muehlenhoff) [06:14:48] (03CR) 10Marostegui: [C:03+1] profile::mariadb::beta: Initialize system schema on fresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [06:15:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:18:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:23:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:41:28] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Disable osbpo sync [puppet] - 10https://gerrit.wikimedia.org/r/1294980 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [06:42:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [06:46:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:48:02] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12013192 (10Hakimi97) Also happen on Malay Wikipedia, see https://ms.wikipedia.org/w/index.php?title=Templat:Statistik_Penyelia_dan_Birokrat&oldid=6872043 [06:52:24] (03PS1) 10Muehlenhoff: mirrors: Removing remaining bits of osbpo mirror [puppet] - 10https://gerrit.wikimedia.org/r/1301232 (https://phabricator.wikimedia.org/T416707) [06:53:02] (03PS1) 10Arnaudb: gerrit: add 5xx and 4xx alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1301233 (https://phabricator.wikimedia.org/T428979) [06:54:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301232 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [06:56:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1300101 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260612T0700) [07:00:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:01:41] (03PS1) 10Tiziano Fogli: netops/iface/saturation: strip cableid [alerts] - 10https://gerrit.wikimedia.org/r/1301236 (https://phabricator.wikimedia.org/T424794) [07:04:35] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Removing remaining bits of osbpo mirror [puppet] - 10https://gerrit.wikimedia.org/r/1301232 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:09:57] (03CR) 10Marostegui: [C:03+2] Revert "installserver: Format es2045 entirely" [puppet] - 10https://gerrit.wikimedia.org/r/1300728 (owner: 10Marostegui) [07:10:20] (03PS2) 10Marostegui: Revert "installserver: Format es2045 entirely" [puppet] - 10https://gerrit.wikimedia.org/r/1300728 [07:11:15] (03Abandoned) 10Marostegui: Revert "installserver: Format es2045 entirely" [puppet] - 10https://gerrit.wikimedia.org/r/1300728 (owner: 10Marostegui) [07:16:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:18:09] (03PS1) 10Muehlenhoff: Stop monitoring the Ubuntu part of our mirror [alerts] - 10https://gerrit.wikimedia.org/r/1301237 (https://phabricator.wikimedia.org/T416707) [07:20:28] (03PS1) 10Muehlenhoff: mirrors: Disable the NRPE check for the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1301238 (https://phabricator.wikimedia.org/T416707) [07:20:45] (03CR) 10CI reject: [V:04-1] Stop monitoring the Ubuntu part of our mirror [alerts] - 10https://gerrit.wikimedia.org/r/1301237 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:21:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:24:19] (03PS2) 10Muehlenhoff: Stop monitoring the Ubuntu part of our mirror [alerts] - 10https://gerrit.wikimedia.org/r/1301237 (https://phabricator.wikimedia.org/T416707) [07:26:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:28:23] (03CR) 10KineticPelagic: [C:03+1] "Looks great. What an uplifting feeling to remove unnecessary code, too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) (owner: 10BPirkle) [07:31:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:36:22] (03PS1) 10Tiziano Fogli: alertmanager-irc-relay: suppress alertname on irc [puppet] - 10https://gerrit.wikimedia.org/r/1301240 (https://phabricator.wikimedia.org/T424794) [07:36:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:36:33] (03PS1) 10Tiziano Fogli: netops/iface/saturation: suppress alertname on irc [alerts] - 10https://gerrit.wikimedia.org/r/1301241 (https://phabricator.wikimedia.org/T424794) [07:38:54] (03CR) 10Muehlenhoff: [C:03+2] Stop monitoring the Ubuntu part of our mirror [alerts] - 10https://gerrit.wikimedia.org/r/1301237 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:41:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:46:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:47:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:51:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:14:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:31:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301300 [08:31:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301300 (owner: 10TrainBranchBot) [08:31:32] (03PS1) 10Trueg: Added DNS entries for the new WDQS 2 deployments in DSE K8s. [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) [08:34:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:41:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:42:00] (03PS1) 10Brouberol: global_config: add external-services for fr-tech minio endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1301303 (https://phabricator.wikimedia.org/T428294) [08:44:34] (03PS4) 10Cathal Mooney: Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) [08:44:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301300 (owner: 10TrainBranchBot) [08:45:57] (03PS1) 10Brouberol: airflow-fr-tech: allow egress of task pods to the fr-tech s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301305 (https://phabricator.wikimedia.org/T428294) [08:51:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:54:14] (03CR) 10Hnowlan: [C:03+1] alertmanager-irc-relay: suppress alertname on irc [puppet] - 10https://gerrit.wikimedia.org/r/1301240 (https://phabricator.wikimedia.org/T424794) (owner: 10Tiziano Fogli) [08:55:55] what's going on with kube-mw-api-ext? [08:56:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:58:28] lots of 'PHP Warning: Undefined array key "C"' [09:04:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:06:03] hnowlan: seems to have started yesterday 17:24:15 ? [09:07:28] (03PS1) 10Muehlenhoff: Add cumin2003 in firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1301309 (https://phabricator.wikimedia.org/T427897) [09:08:22] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Disable the NRPE check for the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1301238 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:08:28] hnowlan: seems to be wikibase related? [09:08:52] /w/api.php TypeError: Wikibase\Client\Usage\UsageDeduplicator::deduplicateStatementUsages(): Argument #1 ($statementUsages) must be of type array, null given, called in /srv/mediawiki/php-1.47.0-wmf.6/extensions/Wikibase/client/includes/Usage/UsageDeduplicator.php on line 73 [09:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:12:14] (03PS2) 10Arthur taylor: Add instance-of WikiProject links for paintings and elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) (owner: 10Sadiya.mohammed13) [09:14:15] (03CR) 10Arthur taylor: [C:03+1] "Looks good. Can go out when the announcement has been made and enough time has passed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) (owner: 10Sadiya.mohammed13) [09:18:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:19:54] (03PS1) 10Blake: mw-wikifunctions: Prune host list for mw-wikifunctions ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) [09:23:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:23:38] (03PS1) 10Jcrespo: Revert^2 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1301314 (https://phabricator.wikimedia.org/T427897) [09:23:57] (03PS2) 10Jcrespo: Revert^2 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1301314 (https://phabricator.wikimedia.org/T427897) [09:23:59] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301314 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [09:31:59] 06SRE, 10SRE-SLO, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#12013634 (10hnowlan) Should this be closed or is there sloth-adjacent work that should be tracked here? [09:32:10] 10SRE-SLO: The Pyrra SLO Duration panel is broken when the latency metric is in milliseconds - https://phabricator.wikimedia.org/T400724#12013636 (10hnowlan) 05Open→03Resolved a:03hnowlan Sloth has replaced Pyrra [09:33:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:33:18] (03CR) 10Federico Ceratto: "I tested it again during the switchover on yesterday. The cookbook ran OK in itself but the scripts rejected the dbctl status (I suspect t" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [09:34:34] 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729#12013641 (10hnowlan) 05Open→03Declined Pyrra has been replaced by Sloth [09:34:35] (03CR) 10Federico Ceratto: "Do we want to start using this with `test-cookbooks` during regular pool/depools?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [09:35:15] (03CR) 10Federico Ceratto: "Ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [09:36:21] (03PS1) 10Muehlenhoff: Add cumin2003 to DB firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1301323 (https://phabricator.wikimedia.org/T427897) [09:36:24] (03PS1) 10Muehlenhoff: Add mysql grant for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301324 (https://phabricator.wikimedia.org/T427897) [09:38:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:39:29] (03PS16) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [09:40:00] (03PS2) 10Federico Ceratto: sre.mysql.upgrade: add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) [09:40:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:43:06] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [09:45:57] (03CR) 10Brouberol: "@astein@wikimedia.org Would you be able (in the future) to define a DNS record that would point to the IPs for both FQDN, and be stable ov" [puppet] - 10https://gerrit.wikimedia.org/r/1301303 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [09:46:49] (03PS1) 10Jcrespo: admin: Remove file-based key and add new yubikey-based key [puppet] - 10https://gerrit.wikimedia.org/r/1301327 [09:47:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:47:34] (03PS1) 10Muehlenhoff: ganeti: Grant RAPI access to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301328 (https://phabricator.wikimedia.org/T427897) [09:48:09] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12013687 (10MoritzMuehlenhoff) [09:48:46] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301327 (owner: 10Jcrespo) [09:49:11] (03CR) 10Jcrespo: [C:03+2] Revert^2 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1301314 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [09:49:50] (03CR) 10Gkyziridis: [C:03+2] ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300806 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [09:50:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:51:08] (03PS2) 10Muehlenhoff: Retire the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294284 (https://phabricator.wikimedia.org/T416707) [09:51:53] (03PS3) 10Federico Ceratto: sre.mysql.upgrade: add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) [09:52:09] (03Merged) 10jenkins-bot: ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300806 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [09:53:36] 06SRE, 10Observability-Metrics, 10Prod-Kubernetes, 06ServiceOps new, 06SRE Observability (FY2025/2026-Q3): write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#12013706 (10MLechvien-WMF) @hnowlan do you know if this is complete? [09:54:06] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12013707 (10TheDJ) [09:55:16] (03PS1) 10Muehlenhoff: Add cumin2003 as additional git peer for Homer [puppet] - 10https://gerrit.wikimedia.org/r/1301330 (https://phabricator.wikimedia.org/T427897) [09:57:26] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'liftwing-openapi-server' for release 'main' . [09:57:58] (03PS1) 10Muehlenhoff: Also sync firmwares to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301331 (https://phabricator.wikimedia.org/T427897) [09:58:36] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'liftwing-openapi-server' for release 'main' . [09:59:02] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'liftwing-openapi-server' for release 'main' . [09:59:02] (03CR) 10Federico Ceratto: "Rebased and ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [10:00:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301331 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [10:02:57] (03CR) 10Federico Ceratto: [C:03+1] "The grant matches the one for cumin2002. The ipaddr matches cumin2003." [puppet] - 10https://gerrit.wikimedia.org/r/1301324 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [10:03:47] (03CR) 10Federico Ceratto: [C:03+1] "The ipaddr matches the hostname." [puppet] - 10https://gerrit.wikimedia.org/r/1301323 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [10:05:36] 06SRE, 10Observability-Metrics, 10Prod-Kubernetes, 06ServiceOps new, 06SRE Observability (FY2025/2026-Q3): write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#12013734 (10hnowlan) This is in progress, should be complete next week. [10:08:49] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [10:12:32] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [10:12:42] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [10:22:49] RECOVERY - MariaDB read only s1 on db1163 is OK: Version 10.11.14-MariaDB-log, Uptime 19541100s, read_only: True, event_scheduler: True, 8203.14 QPS, connection latency: 0.025644s, query latency: 0.000557s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:23:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:28:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:33:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:34:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1301327 (owner: 10Jcrespo) [10:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:35:17] (03CR) 10Jcrespo: [C:03+2] admin: Remove file-based key and add new yubikey-based key [puppet] - 10https://gerrit.wikimedia.org/r/1301327 (owner: 10Jcrespo) [10:35:31] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:35:33] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:35:57] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:36:15] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:37:51] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:38:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:40:11] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:42:57] (03CR) 10Atsuko: [C:03+2] toolhub: switch staging to test opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300795 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [10:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:45:16] (03Merged) 10jenkins-bot: toolhub: switch staging to test opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300795 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [10:46:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:48:13] ^^ this is just the silence expiring I think [10:49:05] (03PS1) 10Jcrespo: Revert^3 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1301336 [10:49:13] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [10:49:21] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [10:52:17] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12013829 (10jcrespo) @elukey Cumin is still not in a working state, see below: ` ��� root@cumin2003:~$ remote-backup-mariadb x1 [09:56:37]: INFO - Create a new empty directory a... [10:53:42] (03CR) 10Jcrespo: [C:03+2] Revert^3 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1301336 (owner: 10Jcrespo) [10:54:59] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12013838 (10MoritzMuehlenhoff) >>! In T427897#12013829, @jcrespo wrote: > @elukey Cumin is still not in a working state, see below: That is expected and will start working once... [10:55:58] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:56:06] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [10:56:15] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [10:56:33] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12013842 (10jcrespo) No worries, just indicating why I cannot migrate the backup jobs yet, will do no problem as soon as remote execution is working! [10:59:21] (03PS1) 10Blake: mediawiki: Use utf-8 for text/plain and text/html. [puppet] - 10https://gerrit.wikimedia.org/r/1301338 (https://phabricator.wikimedia.org/T428772) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260612T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260612T1100). [11:00:40] (03PS1) 10Atsuko: toolhub: listener config for toolhub-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301340 (https://phabricator.wikimedia.org/T426073) [11:01:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:02:11] (03PS3) 10Federico Ceratto: Link to cookbook doc [cookbooks] - 10https://gerrit.wikimedia.org/r/1301339 [11:03:41] (03PS1) 10Clément Goubert: Close API Portal wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) [11:06:31] (03CR) 10Brouberol: [C:03+1] toolhub: listener config for toolhub-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301340 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [11:07:24] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Use utf-8 for text/plain and text/html. [puppet] - 10https://gerrit.wikimedia.org/r/1301338 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [11:07:48] (03CR) 10Atsuko: [C:03+2] toolhub: listener config for toolhub-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301340 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [11:07:51] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:08:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:09:43] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [11:10:05] (03Merged) 10jenkins-bot: toolhub: listener config for toolhub-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301340 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [11:10:16] (03CR) 10Atsuko: [C:03+1] airflow-fr-tech: allow egress of task pods to the fr-tech s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301305 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [11:11:48] (03CR) 10Atsuko: [C:03+1] global_config: add external-services for fr-tech minio endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1301303 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [11:11:59] 06SRE, 10Cloud-Services, 06Infrastructure-Foundations, 10netops: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013 (10cmooney) 03NEW p:05Triage→03Medium The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/pr... [11:12:07] 06SRE, 10Cloud-Services, 06Infrastructure-Foundations, 10netops: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12013913 (10cmooney) [11:12:43] jelto@cumin1003 upgrade (PID 3467633) is awaiting input [11:13:08] (03CR) 10Brouberol: [C:03+2] global_config: add external-services for fr-tech minio endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1301303 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [11:14:09] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12013917 (10cmooney) [11:14:42] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T429014 (10cmooney) 03NEW p:05Triage→03Medium [11:14:53] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T429014#12013933 (10cmooney) [11:15:20] (03CR) 10Brouberol: [C:03+2] airflow-fr-tech: allow egress of task pods to the fr-tech s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301305 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [11:15:44] (03CR) 10Jforrester: "Stupid question: mw-wikifunctions serves www.wikifunctions.org to external users, but the (?) same pods as mw-wikifunctions-ro also servic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [11:17:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:19:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:21:17] (03CR) 10CWilliams: "@fceratto@wikimedia.org please can you be more specific here? e.g. the line(s) of code that you are referring to, or a paste of the issue " [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:22:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:23:21] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06tools-infrastructure-team: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12013945 (10cmooney) [11:23:38] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06tools-infrastructure-team: Upgrade cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T429014#12013946 (10cmooney) [11:24:37] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [11:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:32:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:34:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:35:01] (03CR) 10CWilliams: [C:04-1] "I think that this should have maintained the inheritance from Pool. It would seem to be trivial to have done that rather than to duplicate" [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [11:35:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:35:56] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:36:18] 06SRE, 10DNS: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12013975 (10cmooney) Thanks @CDanis Regarding the SERVFAIL itself I think the reason that happens is that coredns is configured to "fallthrough" if it gets a PTR query for a record it doesn't have. `... [11:36:50] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:39:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:40:53] !log installing Linux 5.10.257 on Bullseye hosts [11:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:27] 06SRE, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12014017 (10cmooney) [11:44:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:47:07] (03PS1) 10Sergio Gimeno: Remove no longer used eventlogging_HomepageModule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301349 (https://phabricator.wikimedia.org/T426742) [11:49:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:49:49] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [11:50:25] (03PS3) 10Phuedx: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [11:50:48] (03PS4) 10Phuedx: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [11:54:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:59:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:01:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus5003.eqsin.wmnet to drbd [12:02:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of prometheus5003.eqsin.wmnet to drbd [12:03:36] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:04:09] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:04:10] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:04:41] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:10:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:10:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [12:10:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [12:14:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:17:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [12:19:50] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:22:06] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:22:09] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:22:10] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:22:13] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:22:27] PROBLEM - Host prometheus5003 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:30] got paged [12:22:46] !ack [12:22:46] 8072 (ACKED) Alertmanager-passive: Prometheus Watchdog Alerts Presence check is still DOWN [12:22:46] 8073 (ACKED) Alertmanager-active: Prometheus Watchdog Alerts Presence check is still DOWN [12:22:48] metamonitoring [12:23:11] moritzm: that's expected right? [12:23:22] work on the ganeti hosts in the same dc [12:23:46] I guess missed downtime [12:23:50] on the metamonitoring side [12:24:48] volans: yes, downtiming has not been implemented yet.. [12:24:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12014136 (10MoritzMuehlenhoff) [12:24:57] :( [12:25:11] tappof: so we just ignore it for a bit? [12:26:36] Raine: if we're aware of any ganeti-related activities, then yes... [12:26:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:26:47] ack, thanks [12:26:51] I got that from -observability [12:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:28:48] my bad... I hadn't seen the notification previously... thanks, volans. [12:31:30] tappof: is it expected to page every X minutes? [12:31:32] paged again [12:31:59] !ack [12:31:59] All incidents are already acked. [12:32:03] ah the passive one [12:32:31] so far we got 3 pages for a single host down... not sure if expected [12:32:42] seems a bit overzelous [12:32:47] looking [12:32:58] filed T429020 for downtiming the alerts [12:32:59] it initiated the command, but something failed [12:32:59] T429020: Add downtime support to metamonitoring alerts - https://phabricator.wikimedia.org/T429020 [12:33:45] it's starting again [12:33:50] volans: The alerts cover different scenarios... with the host down, they all get triggered.. [12:34:22] RECOVERY - Host prometheus5003 is UP: PING OK - Packet loss = 0%, RTA = 232.16 ms [12:34:30] the plain->DRBD migration failed due to some internal consistency check and that left the VM still shut down [12:34:41] I've restarted it manually for now, sorry for the noise [12:34:54] will dig into why that failed and not re-attempt today [12:34:59] but is still with plain disk right now or back do drbd? [12:35:14] volans: I'll think about how to make it less noisy.. [12:35:24] volans: in plain, as before [12:35:31] ack thx both [12:35:48] DRBD failed to allocate a secondary with a cryptic error, need to dig into what that actually means [12:36:43] (03CR) 10Federico Ceratto: sre.mysql: split pool/depool (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [12:37:52] moritzm: I can put the checks in maintenance mode for a while if you need to do any more reboots (cc volans Raine) [12:38:18] tappof: nah, not today. I'll check in more depth and will sync up when I re-try next week [12:38:29] ack moritzm thx [12:38:34] the underlying problem is that eqsin still has the old Ganeti server hardware with just 400G effective space for VMs and we've grown the number of VMs in the edges (and the amount of disk space they need) [12:38:54] to an extent where fully dropping one node for a reimage (as for the switch renewal prep work this week) [12:39:15] ack [12:39:33] requires such hacks as moving the 160G prometheus instance (which needs this twice given DRBD) must temporarily move away from DRBD [12:39:44] (03CR) 10Milazg: [C:03+1] REST: set new RestModuleOverrides variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) (owner: 10BPirkle) [12:39:47] (03CR) 10Btullis: [C:03+1] airflow: export the CLASSPATH environment variable into the task-pod shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299527 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [12:40:00] fortuantely these are being refreshed next FY to the current hardware type where we have almost twice the disk space [12:41:16] (03Abandoned) 10Brouberol: airflow: export the CLASSPATH environment variable into the task-pod shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299527 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [12:42:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:42:18] tappof: AFAICT the metamonitoring has recovered, but it doesn't seem to resolve the alerts on the paging side [12:43:00] spoke too soon, they are gone now [12:43:02] my bad :D [12:43:10] (03PS1) 10Slyngshede: Re-enable WebAuthN [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1301355 [12:43:24] volans: :) [12:44:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:44:53] (03PS9) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [12:44:59] (03CR) 10Blake: "I suspect not, if that's internal traffic, but Scott will have a more authoritative view here, I'll check in with him and update this next" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [12:45:30] (03CR) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [12:48:23] (03CR) 10CWilliams: [C:04-1] sre.mysql: split pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [12:49:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:49:20] (03CR) 10CWilliams: [C:04-1] sre.mysql: split pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [12:55:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:56:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad row A/B switch upgrade - https://phabricator.wikimedia.org/T418012#12014223 (10cmooney) @VRiley-WMF when you have the new switches racked can you set the locations in Netbox and rename them? Names should be "lsw1--eqiad" f... [12:57:55] (03PS5) 10Phuedx: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [12:59:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:00:20] (03PS1) 10Hnowlan: metamonitoring: add downtime support [puppet] - 10https://gerrit.wikimedia.org/r/1301356 (https://phabricator.wikimedia.org/T429020) [13:04:08] (03PS1) 10Atsuko: toolhub: switch prod to prod opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301358 (https://phabricator.wikimedia.org/T426073) [13:04:21] (03PS1) 10Cathal Mooney: Nokia: enable DHCP relay and IPv6 RAs on all IRB sub-ints [homer/public] - 10https://gerrit.wikimedia.org/r/1301359 (https://phabricator.wikimedia.org/T428908) [13:04:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:05:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:10:11] (03CR) 10Bking: [C:03+1] dse-k8s: bump opensearch-semantic-search mem quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300945 (https://phabricator.wikimedia.org/T426589) (owner: 10Ryan Kemper) [13:10:38] (03PS2) 10Cathal Mooney: Nokia: enable DHCP relay and IPv6 RAs on all IRB sub-ints [homer/public] - 10https://gerrit.wikimedia.org/r/1301359 (https://phabricator.wikimedia.org/T428908) [13:11:21] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300770 (owner: 10L10n-bot) [13:11:30] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:12:19] (03PS1) 10Btullis: Fix the wdqs namespace in the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301360 (https://phabricator.wikimedia.org/T424338) [13:15:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:16:23] (03CR) 10Dzahn: "ACK, I see the profile::zuul::launcher::user_token in the private repo. Where can I find the new value to update it? Want to put it into a" [puppet] - 10https://gerrit.wikimedia.org/r/1300922 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [13:16:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:21:01] (03PS3) 10Cathal Mooney: Nokia: enable DHCP relay and IPv6 RAs on all IRB sub-ints [homer/public] - 10https://gerrit.wikimedia.org/r/1301359 (https://phabricator.wikimedia.org/T428908) [13:24:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:24:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:25:18] (03CR) 10Trueg: [C:03+1] Fix the wdqs namespace in the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301360 (https://phabricator.wikimedia.org/T424338) (owner: 10Btullis) [13:26:45] (03PS17) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [13:34:03] (03CR) 10Marostegui: [C:03+1] "This can be merged anytime, it is a noop until we deploy the grants to all the DBs" [puppet] - 10https://gerrit.wikimedia.org/r/1301324 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [13:34:40] (03PS1) 10Muehlenhoff: Add component/zookeeper34 [puppet] - 10https://gerrit.wikimedia.org/r/1301364 (https://phabricator.wikimedia.org/T428495) [13:34:47] (03CR) 10Alex Paskulin: [C:03+1] "Looks good! Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [13:36:39] (03CR) 10Federico Ceratto: "That's not related to this CR (for T419874) but in future we can discuss if we want to use this cookbook as part of the switchover process" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [13:40:19] (03CR) 10Bking: [C:03+1] Fix the wdqs namespace in the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301360 (https://phabricator.wikimedia.org/T424338) (owner: 10Btullis) [13:40:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:41:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:44:04] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [13:44:49] (03CR) 10Brouberol: [C:03+1] Fix the wdqs namespace in the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301360 (https://phabricator.wikimedia.org/T424338) (owner: 10Btullis) [13:45:46] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:45:52] (03CR) 10Marostegui: "What do you mean? The issue you saw isn't related to this cookbook?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [13:47:18] (03PS2) 10Bking: WIP: cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) [13:47:38] (03CR) 10CI reject: [V:04-1] WIP: cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [13:47:53] (03PS3) 10Bking: cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) [13:48:25] (03PS4) 10Bking: cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) [13:48:30] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:48:33] (03CR) 10CI reject: [V:04-1] cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [13:50:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:53:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:56:49] (03CR) 10Btullis: [C:03+2] Fix the wdqs namespace in the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301360 (https://phabricator.wikimedia.org/T424338) (owner: 10Btullis) [13:59:24] (03PS1) 10Muehlenhoff: Blocklisting more unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1301368 [13:59:38] (03CR) 10Dzahn: [V:03+1 C:03+2] "tested query" [puppet] - 10https://gerrit.wikimedia.org/r/1299434 (https://phabricator.wikimedia.org/T428290) (owner: 10Aklapper) [13:59:57] (03PS2) 10Muehlenhoff: Blocklisting more unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1301368 [14:00:33] (03CR) 10Bking: [C:03+1] "conditional +1 on addressing my comment about merging/clobbering." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301358 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [14:05:15] (03CR) 10Atsuko: toolhub: switch prod to prod opensearch cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301358 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [14:05:24] (03Merged) 10jenkins-bot: Fix the wdqs namespace in the dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301360 (https://phabricator.wikimedia.org/T424338) (owner: 10Btullis) [14:05:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:08:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:08:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:11:21] (03CR) 10Cathal Mooney: [C:03+1] "TY! That will definitely help :)" [alerts] - 10https://gerrit.wikimedia.org/r/1301241 (https://phabricator.wikimedia.org/T424794) (owner: 10Tiziano Fogli) [14:12:07] (03CR) 10Dzahn: [C:03+1] gitlab: advertise gitlab-ssh url on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [14:12:57] (03CR) 10Dzahn: [C:03+1] gitlab: support extra ssh host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [14:13:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:13:58] (03PS1) 10Lucas Werkmeister (WMDE): Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301371 (https://phabricator.wikimedia.org/T428620) [14:14:30] (03CR) 10Marostegui: [C:03+1] "I have no context on this, so from my side, the syntax is ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/1298835 (https://phabricator.wikimedia.org/T409857) (owner: 10FNegri) [14:14:45] volans, Raine: I would like to do an emergency deploy; per https://wikitech.wikimedia.org/wiki/Deployments/Emergencies, are there any ongoing issues that would prevent it? [14:15:16] Lucas_WMDE: no issues [14:15:17] dduvall, jnuche as train conductors: I would like to emergency deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1301371 if that’s okay [14:15:29] (I already have someone to deploy, myself) [14:15:30] Raine: thanks [14:16:27] Lucas_WMDE: ok from RelEng [14:17:03] alright, just checking if I’ll be able to confirm the fix on WikimediaDebug (i.e. reproduce it) [14:17:19] ok from me given the content of the patch, though I'm curious what the impact is [14:17:40] Raine: AFAIK the user impact is low but many ops-y people are complaining about the logspam [14:18:03] ok, fair enough, go for it [14:18:08] alright, I can reproduce the error on mwdebug [14:18:10] deploying [14:18:12] thanks! [14:18:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301371 (https://phabricator.wikimedia.org/T428620) (owner: 10Lucas Werkmeister (WMDE)) [14:18:38] (that still leaves some 5-10 minutes for anyone to shout stop at me before it starts rolling out ^^) [14:19:48] no objections [14:20:51] (03CR) 10Federico Ceratto: "This cookbook (according to its task) is meant to set sections in RO mode and change the MariaDB RO variable on the primary master." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [14:25:34] (03CR) 10CI reject: [V:04-1] Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301371 (https://phabricator.wikimedia.org/T428620) (owner: 10Lucas Werkmeister (WMDE)) [14:26:04] hm, main test build failed ^ [14:26:09] let’s hope the gate-and-submit works anyway? [14:26:22] otherwise tbh I’d be inclined to force-merge, this is an emergency fix [14:26:43] (error at https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83/83993/console is somewhere in CentralAuth, seems hardly conceivable that it could be related to the patch at all) [14:28:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:29:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:29:38] (03PS4) 10Dzahn: contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) [14:30:03] (03Merged) 10jenkins-bot: Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301371 (https://phabricator.wikimedia.org/T428620) (owner: 10Lucas Werkmeister (WMDE)) [14:30:08] yay [14:30:16] wait what [14:30:23] scap/spiderpig failed anyway 😠 [14:30:32] * Lucas_WMDE retries [14:31:52] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1301371|Hotfix for T428620 (T428620)]] [14:31:58] T428620: TypeError: Wikibase UsageDeduplicator::deduplicateStatementUsages(): Argument must be of type array (Warning: Undefined array key "C") - https://phabricator.wikimedia.org/T428620 [14:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:35:29] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1301371|Hotfix for T428620 (T428620)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:35:33] testing… [14:36:14] purging https://commons.wikimedia.org/wiki/File:Die_St._Peter_und_Pauls-Gemeinde_in_Mankato,_Minnesota,_von_ihren_Anfangen_bis_auf_die_Gegenwart_-_DPLA_-_a5f753b0a678b2901243a33e9dc4b46d_(page_262).jpg on mwdebug seems to produce no logstash messages about the error, yay [14:36:33] (03PS2) 10Hnowlan: metamonitoring: add downtime support [puppet] - 10https://gerrit.wikimedia.org/r/1301356 (https://phabricator.wikimedia.org/T429020) [14:36:37] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with deployment [14:38:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:43:10] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1301371|Hotfix for T428620 (T428620)]] (duration: 11m 17s) [14:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:43:16] T428620: TypeError: Wikibase UsageDeduplicator::deduplicateStatementUsages(): Argument must be of type array (Warning: Undefined array key "C") - https://phabricator.wikimedia.org/T428620 [14:43:21] logstash is looking *much* better, yay [14:43:39] * Lucas_WMDE done deploying unless something unexpected pops up [14:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:48:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:50:37] yay [15:02:18] (03CR) 10Ahmon Dancy: "Thanks Marostegui!" [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [15:21:43] (03CR) 10Bking: [C:03+2] cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [15:24:02] (03PS2) 10Atsuko: translate: production opensearch on k8s endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301373 (https://phabricator.wikimedia.org/T425377) [15:24:19] (03PS18) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [15:26:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301373 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [15:27:58] (03CR) 10DCausse: [C:03+1] translate: production opensearch on k8s endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301373 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [15:28:58] (03PS1) 10Jforrester: CacheTesterResultsJob: Re-hydrate stashedResult to stdClass [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301389 (https://phabricator.wikimedia.org/T428954) [15:36:38] (03PS3) 10Daniel Kinzler: rest-gateway: Dockerize system tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297666 (https://phabricator.wikimedia.org/T424825) [15:36:43] (03CR) 10Cathal Mooney: [C:03+1] "LGTM thanks! Fwiw we need to update out deploy scripts for Homer for trixie so right now the tool isn't properly installed on it. We'll " [puppet] - 10https://gerrit.wikimedia.org/r/1301330 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [15:37:42] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1301309 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [15:38:19] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1301328 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [15:39:16] (03CR) 10Cathal Mooney: [C:03+1] "tyvm!" [alerts] - 10https://gerrit.wikimedia.org/r/1301236 (https://phabricator.wikimedia.org/T424794) (owner: 10Tiziano Fogli) [15:47:11] (03PS19) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [15:47:30] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:53:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12014764 (10MoritzMuehlenhoff) [15:56:02] (03CR) 10Ssingh: "Once this is ready, please let me know so I can review and also we will need to run VTC on both text and upload." [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [15:56:42] (03PS1) 10Bking: cirrussearch: hard-code bind mount [puppet] - 10https://gerrit.wikimedia.org/r/1301396 (https://phabricator.wikimedia.org/T425585) [15:58:57] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [15:59:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:00:30] (03PS1) 10Hnowlan: sre: Add sre.metamonitoring.downtime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) [16:03:57] (03CR) 10Hnowlan: sre: Add sre.metamonitoring.downtime cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [16:05:31] (03CR) 10Bking: [C:03+1] toolhub: switch prod to prod opensearch cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301358 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [16:07:33] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:10:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:15:22] (03PS4) 10Daniel Kinzler: rest-gateway: Dockerize system tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297666 (https://phabricator.wikimedia.org/T424825) [16:16:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:26:24] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:28:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:34:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:39] (03PS1) 10Dzahn: releases: mask tmp.mount [puppet] - 10https://gerrit.wikimedia.org/r/1301400 (https://phabricator.wikimedia.org/T418299) [16:36:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:38:22] (03PS1) 10Arlolra: Deploy PRV to 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301401 (https://phabricator.wikimedia.org/T429038) [16:38:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:41:05] (03CR) 10Dzahn: "we should use systemd::mask whereever we try to mask things" [puppet] - 10https://gerrit.wikimedia.org/r/1301400 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [16:45:44] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1300916/8734/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:48:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:53:05] (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1301396 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [16:53:49] FIRING: [2x] HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:54:04] 06SRE, 10Wikimedia-Mailing-lists: "You are doing that too often. Please try again later." during subscription a mailing list. - https://phabricator.wikimedia.org/T220914#12014934 (10valerio.bozzolan) 05Declined→03Open Steps to reproduce: - be anonymous - subscribe here: https://lists.wikimedia.org/postori... [16:59:57] (03CR) 10Volans: "Nice addition! I noticed a couple of minor things and left few nits ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [17:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:14:13] (03CR) 10CDobbins: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [17:20:09] (03CR) 10Bking: [V:03+2 C:03+2] "Jenkins looks like it's stuck and this is not touching prod, so I'm going to go ahead and self-merge." [puppet] - 10https://gerrit.wikimedia.org/r/1301396 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:23:31] (03PS1) 10Bking: cirrussearch: Fix bind mount path for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1301408 (https://phabricator.wikimedia.org/T425585) [17:33:33] (03PS5) 10Daniel Kinzler: rest-gateway: Dockerize system tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297666 (https://phabricator.wikimedia.org/T424825) [17:39:38] (03CR) 10BCornwall: "I think it would be less maintenance, less cognitive overload, and less human time spent if we just put these in the VCL and were done wit" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [17:50:58] (03CR) 10Bking: [V:03+2 C:03+2] "Similar to my previous change, Jenkins is stuck and this does not touch prod. Self-merging..." [puppet] - 10https://gerrit.wikimedia.org/r/1301408 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:59:19] (03PS1) 10Dwisehaupt: Adjust fundraisingdb-read for host update/reboot [dns] - 10https://gerrit.wikimedia.org/r/1301415 [17:59:34] (03PS1) 10Majavah: hieradata: Make jenkins_enabled match reality [puppet] - 10https://gerrit.wikimedia.org/r/1301416 [18:00:26] (03CR) 10Majavah: [V:03+2 C:03+2] "force-merging since this is to unbreak CI" [puppet] - 10https://gerrit.wikimedia.org/r/1301416 (owner: 10Majavah) [18:05:11] (03CR) 10Komla Sapaty: "Gentle reminder on this. Happy to make any additional changes if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (https://phabricator.wikimedia.org/T423549) (owner: 10Komla Sapaty) [18:31:40] (03CR) 10Dwisehaupt: [C:03+2] "@astein@wikimedia.org gave a verbal +2. pushing this." [dns] - 10https://gerrit.wikimedia.org/r/1301415 (owner: 10Dwisehaupt) [18:32:08] !log dwisehaupt@dns1006 START - running authdns-update [18:34:04] !log dwisehaupt@dns1006 END - running authdns-update [18:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [18:45:39] (03PS1) 10Dwisehaupt: point fundraisingdb-read back at frdb1008 [dns] - 10https://gerrit.wikimedia.org/r/1301418 [19:05:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:06:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:07:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:07:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:15:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:15:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:17:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:17:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:20:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:20:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:21:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:22:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:25:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:25:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:27:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:30:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:31:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:31:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:39:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:39:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:41:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:42:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:52:11] (03CR) 10Dwisehaupt: "Verbal +2 from @astein@wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/1301418 (owner: 10Dwisehaupt) [19:52:18] (03CR) 10Dwisehaupt: [C:03+2] point fundraisingdb-read back at frdb1008 [dns] - 10https://gerrit.wikimedia.org/r/1301418 (owner: 10Dwisehaupt) [19:52:54] !log dwisehaupt@dns1004 START - running authdns-update [19:54:36] !log dwisehaupt@dns1004 END - running authdns-update [20:13:36] (03PS1) 10Fabfur: cache::haproxy: remove x-provenance feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) [20:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [20:35:50] (03PS1) 10Fabfur: cache::haproxy: req.provenance to txn.provenance [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) [20:35:53] (03PS1) 10Fabfur: cache::haproxy: log txn.provenance variable for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1301432 (https://phabricator.wikimedia.org/T427068) [20:40:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [20:40:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301432 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [20:54:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:23:39] (03PS1) 10Ahmon Dancy: beta: Add deployment-db15 to db-labs config at weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301445 (https://phabricator.wikimedia.org/T428930) [21:27:44] (03CR) 10Scott French: "Oh, that's very interesting to know!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [21:29:14] (03CR) 10Ahmon Dancy: [C:03+2] beta: Add deployment-db15 to db-labs config at weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301445 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [21:30:12] (03Merged) 10jenkins-bot: beta: Add deployment-db15 to db-labs config at weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301445 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [21:32:10] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1301364 (https://phabricator.wikimedia.org/T428495) (owner: 10Muehlenhoff) [21:53:17] (03CR) 10Dzahn: "updated zuul launcher token in private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1300922 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [21:53:33] (03CR) 10Dzahn: [C:03+2] zuul: Update certificate_authority_data for new cluster [puppet] - 10https://gerrit.wikimedia.org/r/1300922 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [22:02:35] (03Abandoned) 10Ahmon Dancy: beta::autoupdater: Remove more obsolete stuff after scap prep auto [puppet] - 10https://gerrit.wikimedia.org/r/753787 (owner: 10Ahmon Dancy) [22:04:51] (03PS2) 10Arlolra: Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301401 (https://phabricator.wikimedia.org/T429038) [22:17:51] uhhhhh did gerrit just go down for everyone else too? "upstream connect error or disconnect/reset before headers. reset reason: connection timeout" [22:17:53] (03CR) 10Jforrester: "> James, do you happen to know whether that's limited to a fixed set of hosts (e.g., just wikidata and commons)?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [22:19:36] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:20:07] bearloga: gerritt receiving a lot of scraping action at the moment, SRE team is looking [22:20:42] ugh/oof! good luck to the SRE team! [22:24:31] RESOLVED: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:24:48] (03PS18) 10Ahmon Dancy: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [22:30:26] (03CR) 10Ahmon Dancy: "bd808: I added a one hour timeout since I noticed hangs during /usr/local/bin/wmf-beta-update-databases.py recently." [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [22:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [22:51:58] (03PS1) 10Arlolra: Store nowiki source in StripState::extra to support subst-nowiki [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) [22:56:50] bearloga: gerritt should be acting better now - [23:02:42] (03CR) 10CI reject: [V:04-1] Store nowiki source in StripState::extra to support subst-nowiki [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [23:02:47] (03CR) 10Thcipriani: "Noticed this today looking for apache logs. Is there a log rotate set up here?" [puppet] - 10https://gerrit.wikimedia.org/r/1300049 (https://phabricator.wikimedia.org/T425667) (owner: 10Jelto) [23:21:52] (03CR) 10Arlolra: "recheck" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [23:42:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301455 [23:42:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301455 (owner: 10TrainBranchBot) [23:55:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301455 (owner: 10TrainBranchBot)