[00:04:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1012.eqiad.wmnet with OS bookworm [00:04:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10212901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm execu... [00:10:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078765 (owner: 10TrainBranchBot) [00:15:32] (03CR) 10ZhaoFJx: "Thanks for letting me know!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [00:19:55] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212902 (10phaultfinder) [00:43:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212920 (10phaultfinder) [01:14:25] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:13] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:25:15] (03CR) 10Hamish: Configure ContactPage and IPBE contact form on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish) [01:29:18] (03CR) 10Hamish: [C:03+1] zhwiki: Allow event-organizer self remove usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [01:36:15] FIRING: [3x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:41:15] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:59] (03PS1) 10Hamish: zhwiki: Revise contact page field usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078773 [01:59:43] (03PS1) 10Albertoleoncio: [brwikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) [02:04:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212935 (10phaultfinder) [02:08:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio) [02:14:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:24:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:36:13] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:40] RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:10] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:52] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212956 (10phaultfinder) [02:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:01:13] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212964 (10phaultfinder) [03:10:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212965 (10Papaul) [03:37:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:09:33] FIRING: KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2056.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:11:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:11:44] Deployment cfssl-issuer in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=cert-manager&var-deployment=cfssl-issuer - ... [04:11:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:14:33] FIRING: [2x] KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:16:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:19:33] FIRING: [2x] KubernetesCalicoDown: wikikube-worker2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:19:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212990 (10phaultfinder) [04:29:33] RESOLVED: KubernetesCalicoDown: wikikube-worker2059.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2059.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:33:33] FIRING: KubernetesCalicoDown: mw2447.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2447.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:38:33] RESOLVED: [3x] KubernetesCalicoDown: mw2337.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:41:44] FIRING: [3x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:43:33] FIRING: KubernetesCalicoDown: wikikube-worker2074.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2074.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:48:33] FIRING: [5x] KubernetesCalicoDown: mw2437.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:53:33] FIRING: [9x] KubernetesCalicoDown: mw2310.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:58:33] FIRING: [13x] KubernetesCalicoDown: kubernetes2038.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:01:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [05:03:33] FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:04:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:05:04] <_joe_> I am at the gym [05:05:32] <_joe_> I can’t get home in less than 30 minutes [05:06:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [05:08:33] FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:09:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:11:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:12:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:12:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:12:31] _joe_: just came in to shut down my computer for the night and now reading through back-scroll ... I'll start looking at the KubernetesCalicoDown alerts as a starting point, as I assume they're the source of this [05:12:41] moving to -sre with lower noise [05:13:33] FIRING: [20x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:14:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10213049 (10phaultfinder) [05:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:17:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:17:37] FIRING: ProbeDown: Service miscweb2003:30443 has failed probes (http_static_codereview_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:17:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10213053 (10VRiley-WMF) 05Open→03Resolved Thanks! I'll resolve this for now. Feel free to reopen if the issue crops up again. [05:18:33] FIRING: [23x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:19:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:20:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:20:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:22:37] RESOLVED: ProbeDown: Service miscweb2003:30443 has failed probes (http_static_codereview_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:33] FIRING: [37x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:23:33] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [05:24:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:25:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:25:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:28:33] FIRING: [40x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:28:33] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [05:28:37] FIRING: ProbeDown: Service miscweb2003:30443 has failed probes (http_static_codereview_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:48] FIRING: [40x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:29:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:30:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:33:33] FIRING: [53x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:33:37] RESOLVED: [4x] ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:33:48] FIRING: [53x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:34:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:35:56] FIRING: WcqsStreamingUpdaterFlinkJobNotRunning: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [05:36:02] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:33] FIRING: [69x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:38:52] FIRING: [6x] ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:39:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:39:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:39:53] ο/ [05:40:02] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [05:40:51] FIRING: [5x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:40:56] RESOLVED: WcqsStreamingUpdaterFlinkJobNotRunning: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [05:41:02] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:41:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:41:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:42:31] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:43] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:43:33] FIRING: [85x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:43:48] FIRING: [85x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:43:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:43:57] RESOLVED: [6x] ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:44:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:44:27] FIRING: [2x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:45:51] FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:45:57] FIRING: [2x] ProbeDown: Service miscweb2003:30443 has failed probes (http_research_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:46:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:46:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:46:15] FIRING: [4x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:48:04] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:48:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:48:33] FIRING: [114x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:48:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:48:48] FIRING: [114x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:48:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:49:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=kartotherian.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:49:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [05:50:38] !incidents [05:50:38] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [05:50:38] 5302 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [05:50:39] 5303 (ACKED) ProbeDown sre (ip4 probes/service codfw) [05:50:39] 5304 (UNACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [05:50:39] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [05:50:39] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [05:50:50] !ack 5304 [05:50:50] 5304 (ACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [05:50:51] FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:50:57] FIRING: [8x] ProbeDown: Service miscweb2003:30443 has failed probes (http_design_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:51:11] I have no idea why kartotherian is complaining, but higher priority stuff exists right now [05:51:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:51:15] FIRING: [8x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:51:55] FIRING: [3x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:53:20] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:53:26] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:53:33] FIRING: [132x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 7.831% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:54:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [05:55:39] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [05:55:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - ... [05:55:45] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqcodfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesI [05:56:00] FIRING: [7x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:56:05] FIRING: [12x] ProbeDown: Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:56:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:56:19] FIRING: [7x] ProbeDown: Service cxserver:4002 has failed probes (http_cxserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:56:33] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [05:56:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [05:56:55] RESOLVED: [3x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:57:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:58:33] FIRING: [147x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:59:10] (03PS1) 10Jelto: admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 [05:59:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 24.46% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:59:27] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:59:34] (03CR) 10Alexandros Kosiaris: [C:03+1] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto) [05:59:45] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [06:00:00] (03CR) 10Effie Mouzeli: [C:03+1] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T0600) [06:00:49] (03CR) 10Jelto: [V:03+2] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto) [06:00:51] FIRING: [8x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:00:52] (03CR) 10Jelto: [V:03+2 C:03+2] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto) [06:00:57] RESOLVED: [11x] ProbeDown: Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:01:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:01:15] RESOLVED: [6x] ProbeDown: Service cxserver:4002 has failed probes (http_cxserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:01:33] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [06:01:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [06:01:55] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:03:01] !incidents [06:03:01] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [06:03:02] 5302 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [06:03:02] 5304 (ACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [06:03:02] 5306 (UNACKED) [2x] ProbeDown sre (ip4 probes/service codfw) [06:03:02] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:03:03] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [06:03:03] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [06:03:03] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:03:07] !ack 5306 [06:03:07] 5306 (ACKED) [2x] ProbeDown sre (ip4 probes/service codfw) [06:03:10] !incidents [06:03:10] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [06:03:10] 5302 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [06:03:10] 5304 (ACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [06:03:11] 5306 (ACKED) [2x] ProbeDown sre (ip4 probes/service codfw) [06:03:11] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:03:11] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [06:03:11] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [06:03:12] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:03:50] FIRING: [2x] ProbeDown: Service miscweb2003:30443 has failed probes (http_transparency_archive_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:17] FIRING: [2x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:21] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:04:26] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 1.974% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:04:30] FIRING: [145x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:04:36] FIRING: [145x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:04:45] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:04:57] FIRING: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [06:05:02] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:06] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 16.51% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:05:17] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:05:33] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [06:05:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [06:05:51] FIRING: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:06:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:06:12] RESOLVED: [12x] ProbeDown: Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:13] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [06:06:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:06:19] FIRING: [16x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:45] FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:07:27] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:07:57] RESOLVED: [3x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:07:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:08:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 24.3% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:08:33] FIRING: [157x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:08:36] !incidents [06:08:37] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [06:08:37] 5302 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [06:08:37] 5304 (ACKED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [06:08:38] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [06:08:38] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:08:38] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [06:08:38] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [06:08:39] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:08:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:08:52] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [06:08:56] RESOLVED: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:25] RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:09:55] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:02] !incidents [06:10:02] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [06:10:03] 5302 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [06:10:03] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [06:10:03] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [06:10:03] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:10:03] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [06:10:04] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [06:10:04] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:10:33] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [06:10:41] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [06:10:45] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - ... [06:10:51] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqcodfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesI [06:10:58] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [06:11:12] FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:11:19] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:11:23] RESOLVED: [14x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:11:45] RESOLVED: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:13:00] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:09] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [06:13:14] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:13:33] RESOLVED: [102x] KubernetesCalicoDown: kubernetes2021.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:13:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:13:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:14:23] <_joe_> !incidents [06:14:24] 5300 (ACKED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [06:14:24] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [06:14:24] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [06:14:24] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [06:14:25] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:14:25] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [06:14:25] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [06:14:26] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [06:14:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10213090 (10phaultfinder) [06:14:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [06:17:00] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [06:25:51] RESOLVED: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:36:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1065 - https://phabricator.wikimedia.org/T376775 (10ops-monitoring-bot) 03NEW [06:36:56] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:54:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:10] (03CR) 10Slyngshede: [C:03+2] ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede) [07:01:13] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:02:41] (03Merged) 10jenkins-bot: ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede) [07:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:06:56] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:13:40] !log remove ganeti2010 from active nodes T376594 [07:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:43] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [07:16:15] FIRING: ProbeDown: Service ganeti2010:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:19:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:20:34] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:20:56] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:22:27] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:22:51] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:26:28] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:26:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:27:00] (03PS1) 10Giuseppe Lavagetto: [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 [07:27:54] (03CR) 10Jon Harald Søby: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [07:28:54] (03CR) 10CI reject: [V:04-1] [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (owner: 10Giuseppe Lavagetto) [07:36:58] (03PS1) 10Slyngshede: P:idm add passlib dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1078875 [07:38:40] (03CR) 10Slyngshede: [C:03+2] P:idm add passlib dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1078875 (owner: 10Slyngshede) [07:43:24] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:43:45] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:45:09] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:46:10] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1011.eqiad.wmnet [07:48:31] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:49:23] (03CR) 10Elukey: "test-cookbooked, works fine.. Of course I found other corner cases of BIOS settings for newer Supermicro models, sigh, but unrelated to th" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [07:51:13] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:51:15] RESOLVED: ProbeDown: Service ganeti2010:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:08] (03PS1) 10Muehlenhoff: Switch cloudcephosd1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078891 (https://phabricator.wikimedia.org/T349619) [07:53:21] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078891 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:55:50] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10213162 (10elukey) Found a new interesting issue when running the provision cookbook for mc-misc2001: ` "Message":... [07:57:25] (03CR) 10Elukey: "Folks I am battling with Supermicro and BIOS/UEFI in https://phabricator.wikimedia.org/T365372#10213162, there are some weird things that " [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [07:58:30] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376764#10213168 (10phaultfinder) [08:00:05] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T0800) [08:00:23] o/ [08:00:52] (03PS2) 10Brouberol: airflow: expose non-sensitive configuration in the web UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 [08:02:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1011.eqiad.wmnet [08:02:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1021.eqiad.wmnet [08:03:52] I will now start promoting group1 wikis to 1.43.0-wmf.26 [08:04:11] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078893 (https://phabricator.wikimedia.org/T375657) [08:04:12] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078893 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [08:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:05:02] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078893 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [08:07:21] (03PS1) 10Muehlenhoff: Switch cloudcephosd1021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078894 (https://phabricator.wikimedia.org/T349619) [08:08:14] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078894 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:10:52] (03PS1) 10Slyngshede: P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895 [08:11:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1021.eqiad.wmnet [08:12:00] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.26 refs T375657 [08:12:03] T375657: 1.43.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T375657 [08:12:50] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777 (10JTweed-WMF) 03NEW [08:13:30] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4253/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (owner: 10Slyngshede) [08:16:02] (03PS1) 10Brouberol: data-platform: alert if datahub/superset pods are down for at least 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1078897 [08:16:29] (03CR) 10JMeybohm: "I don't think this is required. NodePort traffic will bypass the INPUT chain and is not captured by ferm." [puppet] - 10https://gerrit.wikimedia.org/r/1078719 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [08:18:36] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephmon1005.eqiad.wmnet [08:18:39] (03PS2) 10Slyngshede: P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895 [08:19:31] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4254/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (owner: 10Slyngshede) [08:19:57] (03PS1) 10Muehlenhoff: Switch cloudcephmon1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078898 (https://phabricator.wikimedia.org/T349619) [08:21:14] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephmon1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078898 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:22:14] (03CR) 10Volans: [stub] mwscript-k8s: Add concurrency limiting via poolcounter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (owner: 10Giuseppe Lavagetto) [08:22:46] (03CR) 10Btullis: [C:03+1] Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [08:23:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudcephmon1005.eqiad.wmnet [08:23:42] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [08:24:18] (03CR) 10Brouberol: [C:03+2] Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [08:24:20] (03CR) 10Brouberol: [V:03+2 C:03+2] Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [08:24:28] (03CR) 10Brouberol: [V:03+1 C:03+2] ceph: provision the dse-k8s-csi-cephfs user capabilities [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [08:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:30:46] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:30:56] (03CR) 10Ayounsi: ripeatlas: clean up resource defs after deletion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [08:33:13] (03PS2) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) [08:35:53] (03CR) 10Brouberol: [C:03+2] data-platform: alert if datahub/superset pods are down for at least 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1078897 (owner: 10Brouberol) [08:36:14] (03CR) 10Stevemunene: [C:03+2] Change an-worker117[67] to use reuse partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [08:36:44] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:37:06] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:37:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:38:14] (03PS3) 10Brouberol: airflow: align default configuration with our pupetized instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 [08:41:18] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:43:21] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 (owner: 10Brouberol) [08:43:52] (03CR) 10Volans: [C:03+1] "The approach LGTM, to be tested if the host behaves as we espect." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:44:27] (03CR) 10Brouberol: [C:03+2] airflow: align default configuration with our pupetized instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 (owner: 10Brouberol) [08:46:09] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:46:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:47:44] (03PS8) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [08:48:24] (03CR) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [08:48:46] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [08:49:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:49:22] (03PS1) 10Lucas Werkmeister (WMDE): scap: Also exclude (my)sql from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) [08:49:50] (03CR) 10Btullis: [C:03+1] airflow: align default configuration with our pupetized instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 (owner: 10Brouberol) [08:49:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:50:50] (03PS3) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) [08:51:13] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [08:51:14] (03CR) 10CI reject: [V:04-1] scap: Also exclude (my)sql from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE)) [08:52:00] (03CR) 10Vgutierrez: [C:03+1] "[nitpick] it looks like you missed the Bug footer on the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (owner: 10Slyngshede) [08:52:08] (03CR) 10Elukey: "I have isolated the code to reboot for readability, and also I made sure that the reboot happens after the BMC network settings are applie" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:52:51] (03PS4) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) [08:53:18] (03PS2) 10Lucas Werkmeister (WMDE): scap: Also exclude (my)sql from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) [08:53:18] (03CR) 10Lucas Werkmeister (WMDE): scap: Also exclude (my)sql from mwscript deprecation warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE)) [08:53:38] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:54:01] (03PS3) 10Slyngshede: P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (https://phabricator.wikimedia.org/T360795) [08:54:02] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:55:17] (03CR) 10Slyngshede: [C:03+2] P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (https://phabricator.wikimedia.org/T360795) (owner: 10Slyngshede) [08:56:17] (03PS1) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [08:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:58:01] (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [09:04:31] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:14:32] (03CR) 10Elukey: [C:03+2] "Tested and it seems working nicely!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:18:45] (03CR) 10Ladsgroup: dumps: Drop the globalblocks table dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [09:20:42] (03PS1) 10Mhorsey: Release CampaignEvents to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078907 (https://phabricator.wikimedia.org/T376786) [09:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:27:12] (03Merged) 10jenkins-bot: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:32:17] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2009/ganeti2010 from Ganeti role [puppet] - 10https://gerrit.wikimedia.org/r/1078660 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [09:37:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10213457 (10MoritzMuehlenhoff) [09:40:05] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1026.eqiad.wmnet [09:41:03] (03PS1) 10Muehlenhoff: Switch cloudcephosd1026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078909 (https://phabricator.wikimedia.org/T349619) [09:42:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10213469 (10MoritzMuehlenhoff) Rollout of further nodes is put on hold until we get Redfish licences for the new servers. [09:42:50] !log Started time limited MediaModertation scan on enwiki for 16hrs to catchup with monthly request limit - https://wikitech.wikimedia.org/wiki/MediaModeration [09:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1000) [10:05:26] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [10:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:07:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:09:48] (03PS9) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [10:09:57] (03PS10) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [10:10:14] (03PS11) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [10:10:43] (03CR) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [10:10:44] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [10:11:09] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [10:12:10] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078909 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:12:55] (03PS1) 10Kosta Harlan: dumps: stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) [10:15:23] (03PS2) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [10:15:39] (03PS2) 10Kosta Harlan: dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) [10:15:46] (03PS3) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [10:16:20] (03CR) 10Kosta Harlan: dumps: Drop the globalblocks table dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [10:16:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1026.eqiad.wmnet [10:17:16] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [10:19:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10213566 (10phaultfinder) [10:19:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:21:48] (03CR) 10Elukey: [C:03+2] swift: avoid rate-limit for the Docker account [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [10:26:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69506 and previous config saved to /var/cache/conftool/dbconfig/20241009-102636-ladsgroup.json [10:27:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1027.eqiad.wmnet [10:28:19] !log roll restart swift-proxy on ms-fe* to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078380 [10:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:23] (03PS1) 10Muehlenhoff: Switch cloudcephosd1027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078916 (https://phabricator.wikimedia.org/T349619) [10:30:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10213583 (10elukey) Hi folks! Not sure what is special about the server, but in this case the Supermicro Network settings for the BMC can't be applied... [10:32:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10213586 (10jcrespo) @elukey thanks, that is all I needed, some info on where we were and I wasn't aware of ongoing progress during my sabbatical, whic... [10:34:31] 06SRE, 06Infrastructure-Foundations: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790 (10MoritzMuehlenhoff) 03NEW [10:34:32] 06SRE, 06Infrastructure-Foundations: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#10213600 (10MoritzMuehlenhoff) p:05Triage→03High [10:35:40] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync [10:36:24] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078916 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:37:02] (03CR) 10Btullis: [C:03+2] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:38:01] btullis: you can merge my patch along [10:41:02] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1077710 (owner: 10Klausman) [10:41:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69507 and previous config saved to /var/cache/conftool/dbconfig/20241009-104142-ladsgroup.json [10:44:15] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [10:44:39] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [10:46:02] (03PS2) 10Btullis: Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) [10:48:00] (03CR) 10Muehlenhoff: "You still need to define the setting for Envoy" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:48:09] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:49:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1027.eqiad.wmnet [10:50:50] this is me --^ [10:51:41] (03PS1) 10Elukey: services: update proxied port for kask [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078918 (https://phabricator.wikimedia.org/T363996) [10:52:00] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:56:00] (03CR) 10Elukey: [C:03+2] "Testing this since it is staging and already broken :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078918 (https://phabricator.wikimedia.org/T363996) (owner: 10Elukey) [10:56:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69511 and previous config saved to /var/cache/conftool/dbconfig/20241009-105647-ladsgroup.json [11:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1100). [11:00:37] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync [11:00:39] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [11:03:38] (03PS2) 10Giuseppe Lavagetto: python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707 [11:03:39] (03PS2) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) [11:03:39] (03PS2) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) [11:04:17] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync [11:04:31] (03CR) 10CI reject: [V:04-1] fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [11:07:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:08:09] RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:11:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69513 and previous config saved to /var/cache/conftool/dbconfig/20241009-111154-ladsgroup.json [11:12:29] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10213709 (10elukey) Rolled out the swift proxy change, it seems that it has a solved the issue. Doing more tests before closing to... [11:14:12] (03PS3) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) [11:14:12] (03PS3) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) [11:15:36] (03CR) 10Ladsgroup: "This absents the timer but not the directories. That's fine if you want to keep old dumps. Whatever you decide." [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [11:24:24] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [11:27:18] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4255/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078707 (owner: 10Giuseppe Lavagetto) [11:29:35] (03CR) 10Volans: [C:03+1] "LGTM, surely it makes sense to convert it to a define so that we can use more than one per host. Thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1078707 (owner: 10Giuseppe Lavagetto) [11:30:13] (03CR) 10Kosta Harlan: "I think that is probably OK? How soon after merging this one can we merge I428c67b7ed7b481bfbe084b0a6e3f1025f9e6d9d ?" [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [11:37:09] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:38:21] (03PS1) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) [11:40:02] (03PS1) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) [11:47:40] (03PS1) 10Brouberol: cephosd: fix syntax of the dse-k8s-csi-cephfs caps [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407) [11:48:25] (03CR) 10Btullis: [C:03+1] "Great, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [11:48:34] (03PS2) 10Brouberol: cephosd: fix syntax of the dse-k8s-csi-cephfs caps [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407) [11:49:55] FIRING: SystemdUnitFailed: wmf_auto_restart_routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:10] (03CR) 10Brouberol: [C:03+2] cephosd: fix syntax of the dse-k8s-csi-cephfs caps [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [11:51:29] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:51:59] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@b2c30ad]: T375153 [11:52:01] !log start systemctl start wmf_auto_restart_routinator.service on rpki2003 [11:52:02] T375153: ETL pipeline for Automoderator daily monitoring metrics - https://phabricator.wikimedia.org/T375153 [11:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:30] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@b2c30ad]: T375153 (duration: 02m 32s) [11:54:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:23] (03Merged) 10jenkins-bot: wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:02:58] (03PS2) 10JMeybohm: [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (https://phabricator.wikimedia.org/T376795) (owner: 10Giuseppe Lavagetto) [12:04:52] (03CR) 10CI reject: [V:04-1] [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (https://phabricator.wikimedia.org/T376795) (owner: 10Giuseppe Lavagetto) [12:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:25] (03CR) 10Brouberol: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [12:11:16] (03CR) 10Awight: [C:03+1] [config] Rename moved gadget name setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [12:11:56] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:01] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:15:15] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:15:46] !log jelto@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:16:04] !log jelto@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:18:59] !log installing initramfs-tools bugfix updates from Bookworm point release [12:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:07] (03CR) 10Milimetric: [C:03+1] "Just to say I approve of this change. In our attempts to find users of various dumps, we haven't heard anyone speak up for this one." [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [12:20:13] (03PS1) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) [12:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:21:39] (03PS1) 10JMeybohm: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) [12:21:41] (03PS1) 10JMeybohm: Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) [12:23:26] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:23:37] (03PS2) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) [12:23:43] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:24:02] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:24:14] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:24:36] (03CR) 10Gmodena: dse-k8s-services: content_history: version bump image. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [12:26:59] (03CR) 10Giuseppe Lavagetto: [C:03+1] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [12:27:24] (03CR) 10Giuseppe Lavagetto: [C:03+1] Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [12:33:16] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts rpki2002.codfw.wmnet [12:33:57] (03PS1) 10Kosta Harlan: QuickSurveys: Deploy Safety Survey with zero coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) [12:34:33] (03CR) 10Tiziano Fogli: ripeatlas: clean up resource defs after deletion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [12:35:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [12:36:09] (03CR) 10Daimona Eaytoy: [C:03+1] "Noting that this will configure the extension to use the shared database in x1, so:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio) [12:38:07] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:38:12] (03PS2) 10JMeybohm: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) [12:38:12] (03PS2) 10JMeybohm: Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) [12:38:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [12:38:59] (03Abandoned) 10CDanis: ferm: allow DNS traffic against k8s control planes [puppet] - 10https://gerrit.wikimedia.org/r/1078719 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [12:41:13] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:41:16] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rpki2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [12:41:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rpki2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [12:41:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rpki2002.codfw.wmnet [12:43:43] (03PS3) 10Tiziano Fogli: ripeatlas: clean up resource defs after deletion [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) [12:45:10] (03CR) 10Tiziano Fogli: [C:03+2] ripeatlas: clean up resource defs after deletion [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [12:45:34] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10213939 (10Aklapper) (Per T376798 I removed an image from this task.) [12:47:28] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:51:38] (03CR) 10Alexandros Kosiaris: [C:03+1] Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [12:52:35] (03CR) 10Alexandros Kosiaris: [C:03+1] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [12:53:05] (03CR) 10Klausman: [C:03+2] hiera/modules: Add ML Lab machine roles and config [puppet] - 10https://gerrit.wikimedia.org/r/1077710 (owner: 10Klausman) [12:54:42] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10213960 (10ayounsi) Phase 2 lgtm, one point though : you need to trunk the management vlan between the old and new switch for fasw to be reachable between step... [12:57:17] (03CR) 10Btullis: [C:03+2] Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [12:58:57] (03CR) 10Albertoleoncio: "Yep, that's right." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio) [12:59:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1300). [13:00:05] Ammar, albertoleoncio, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10213969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [13:00:09] hi [13:00:13] hi [13:00:31] o/ [13:00:32] do you all mind if I go first, as I need to be away from the keyboard in ~25 minutes? [13:00:42] sure, go ahead imho [13:00:46] sure [13:00:49] thx [13:01:09] starting then [13:01:11] (I’ll also be in a meeting in half an hour from now btw, so let’s see if we get through all the changes in time) [13:01:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:02:08] (03CR) 10Daimona Eaytoy: [C:03+1] "Great, thanks for confirming ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio) [13:02:14] ack, will move it along as quickly as I can [13:02:35] (03Merged) 10jenkins-bot: QuickSurveys: Deploy Safety Survey with zero coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:03:38] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1078929|QuickSurveys: Deploy Safety Survey with zero coverage (T376517)]] [13:03:40] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [13:05:22] (03PS1) 10Klausman: aptrepo: Add two missing packages to rocm61 repo [puppet] - 10https://gerrit.wikimedia.org/r/1078937 (https://phabricator.wikimedia.org/T376380) [13:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:06:07] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1078929|QuickSurveys: Deploy Safety Survey with zero coverage (T376517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:57] which option should I use with the WikimediaDebug browser extension? k8s-mwdebug? [13:08:19] kostajh: yep :) [13:08:26] thx [13:08:39] (03CR) 10Jelto: [C:03+1] "I think the timer jobs are present on the replica machines, see https://puppet-compiler.wmflabs.org/output/1078752/4250/gerrit2003.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [13:09:01] the other ones should also work at the moment (`scap backport` deploys to all of them) but k8s-mwdebug is the one with a future ;) [13:09:36] !log kharlan@deploy2002 kharlan: Continuing with sync [13:09:40] lgtm [13:09:47] (03CR) 10FNegri: team-wmcs: add kernel panic alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [13:11:08] Ammar: just checking, are you around? (once kostajh is done deploying) [13:11:46] Lucas_WMDE yes [13:11:56] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:02] ok :) [13:12:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host etherpad2002.codfw.wmnet [13:12:13] (03CR) 10Ilias Sarantopoulos: [C:03+1] aptrepo: Add two missing packages to rocm61 repo [puppet] - 10https://gerrit.wikimedia.org/r/1078937 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [13:13:51] (03CR) 10Klausman: [C:03+2] aptrepo: Add two missing packages to rocm61 repo [puppet] - 10https://gerrit.wikimedia.org/r/1078937 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [13:14:15] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078929|QuickSurveys: Deploy Safety Survey with zero coverage (T376517)]] (duration: 10m 37s) [13:14:18] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [13:14:23] ok, over to you Lucas_WMDE [13:14:24] thanks! [13:14:28] ok! [13:14:30] thank you! [13:14:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078680 (https://phabricator.wikimedia.org/T376536) (owner: 10Ammarpad) [13:14:52] (03PS2) 10Slyngshede: Speed holes. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 [13:15:15] Hello 0lly! We're getting some puppet errors on `datahubsearch` (opensearch errors). Based on T362429 , it seems this could be related to your team releasing a new curator pkg...can anyone take a look? [13:15:15] T362429: Investigate Puppet failures on datahubsearch hosts - https://phabricator.wikimedia.org/T362429 [13:15:25] (03Merged) 10jenkins-bot: sdwiki: Add new logo and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078680 (https://phabricator.wikimedia.org/T376536) (owner: 10Ammarpad) [13:15:32] hmm [13:15:36] oops, wrong room [13:15:38] so my change worked with mw debug [13:15:52] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1078680|sdwiki: Add new logo and tagline (T376536)]] [13:15:55] T376536: Request for change the sd.wikipedia logo - https://phabricator.wikimedia.org/T376536 [13:15:56] but now I get `Error: Module "ext.quicksurveys.lib" is not loaded` when I try to load the survey :/ [13:16:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad2002.codfw.wmnet [13:16:39] (03PS3) 10Slyngshede: Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 [13:16:56] ah, now it works 🤷 [13:17:40] huh, ok [13:17:41] (03PS4) 10Slyngshede: Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 [13:17:55] I guess you need to be on a purged page or something? [13:17:56] not sure [13:18:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host etherpad1004.eqiad.wmnet [13:18:11] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ammarpad: Backport for [[gerrit:1078680|sdwiki: Add new logo and tagline (T376536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:18:30] Ammar: please test! [13:18:49] OK [13:19:00] I definitely see a difference, but I can’t say if it’s right or not ^^ [13:20:22] (03CR) 10Btullis: "I'll have a go at this." [puppet] - 10https://gerrit.wikimedia.org/r/1076910 (owner: 10Hashar) [13:22:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1004.eqiad.wmnet [13:22:38] I tried with purging but really can't get the new logo [13:22:49] did you try ctrl+f5? [13:22:55] that’s what I had to do [13:23:08] (and with WikimediaDebug enabled, of course [13:23:08] This is the new logo https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-sd.svg [13:23:45] yes, that matches what I see at https://sd.wikipedia.org/wiki/%D9%85%D9%8F%DA%A9_%D8%B5%D9%81%D8%AD%D9%88?useskin=vector [13:23:50] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [13:25:14] (03CR) 10JMeybohm: [C:03+2] Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [13:25:17] (03CR) 10JMeybohm: [C:03+2] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [13:25:31] (03CR) 10Hashar: "Done by I393e27133e0ad7bb414491e76fa959368c14be86" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [13:26:05] (03PS1) 10Btullis: Allow underscores, hyphens, and dots in hdfs_file names [puppet] - 10https://gerrit.wikimedia.org/r/1078940 (https://phabricator.wikimedia.org/T323692) [13:26:31] Ammar: did you try force-reloading the page? (https://en.wikipedia.org/wiki/Help:Purge#Purge_local_browser_cache has some more keyboard shortcuts) [13:26:43] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4256/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [13:27:08] @Lucas_WMDE Ok yes it works now. (yes I am using WikimediaDebug) [13:27:12] ok! [13:27:13] :] [13:27:15] and does it look correct? [13:27:41] (I’m not sure if that’s implied by “it works now” so I just want to make sure ^^) [13:27:44] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [13:28:13] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org [13:28:22] (03CR) 10Elukey: [C:03+2] "Already fixed by Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [13:29:24] (03Merged) 10jenkins-bot: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [13:29:25] (03Merged) 10jenkins-bot: Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [13:30:35] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org [13:30:45] I guess I’ll continue with the deployment… [13:30:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ammarpad: Continuing with sync [13:30:54] Lucas_WMDE sorry. I am seeing the new logo correctly. It works [13:30:58] ok, great! [13:31:00] You can proceed [13:31:03] thanks! [13:31:05] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [13:31:49] (03CR) 10Xcollazo: [C:03+1] dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [13:31:50] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, let me know when this should be deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [13:32:09] (03PS1) 10Klausman: aptrepo: Add more missing packages to the rocm61 import [puppet] - 10https://gerrit.wikimedia.org/r/1078941 (https://phabricator.wikimedia.org/T376380) [13:32:09] RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:32:27] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gerrit2003.wikimedia.org [13:32:35] (03Abandoned) 10Hashar: Add some HIDPI Wikivoyage logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529464 (https://phabricator.wikimedia.org/T230114) (owner: 10Jc86035) [13:32:37] (03CR) 10Klausman: [C:03+2] aptrepo: Add more missing packages to the rocm61 import [puppet] - 10https://gerrit.wikimedia.org/r/1078941 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [13:33:10] 06SRE, 06Data-Engineering, 06Data-Platform, 10Dumps-Generation, and 4 others: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10214033 (10xcollazo) As per [[ https://wikimedia.slack.com/archives/CTFK3B423/p1728413707760419 | slack discussion ]], noting... [13:33:28] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [13:34:18] Lucas_WMDE Thank you [13:34:39] np :) [13:35:26] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078680|sdwiki: Add new logo and tagline (T376536)]] (duration: 19m 34s) [13:35:29] T376536: Request for change the sd.wikipedia logo - https://phabricator.wikimedia.org/T376536 [13:35:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10214086 (10MoritzMuehlenhoff) [13:36:06] is anyone else around who can deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1078774 for albertoleoncio? [13:36:13] I’m in a meeting now so I’d prefer not to deploy in parallel [13:36:16] please... =D [13:37:05] (03Abandoned) 10Hashar: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [13:37:17] oh, damn, and I need to purge for Ammar [13:37:17] one sec [13:39:05] !log lucaswerkmeister-wmde@deploy2002 $ printf 'https://en.wikipedia.org/static/images/%s\n' 'project-logos/sdwiki.png' 'project-logos/sdwiki-1.5x.png' 'project-logos/sdwiki-2x.png' 'mobile/copyright/wikipedia-wordmark-sd.svg' 'mobile/copyright/wikipedia-tagline-sd.svg' | mwscript-k8s --attach -- purgeList.php # T376536 [13:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:22] took a bit longer than usual but done [13:39:29] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum [13:39:34] (03CR) 10Jelto: [V:03+1 C:03+2] jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [13:39:43] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test1004.wikimedia.org [13:39:48] (03Merged) 10jenkins-bot: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [13:40:06] (03PS1) 10Muehlenhoff: Fix /etc/issue for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1078942 [13:40:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio) [13:40:30] alright, I’ll do the deploy for albertoleoncio [13:40:30] (03PS2) 10Muehlenhoff: Fix /etc/issue for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1078942 [13:40:37] might just be a bit slower than usual ^^ [13:40:44] but should be doable in the remaining 20 minutes [13:40:45] jouncebot: next [13:40:46] In 0 hour(s) and 19 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1400) [13:41:08] (03Merged) 10jenkins-bot: [brwikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio) [13:41:24] !log brouberol@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [13:41:33] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1078774|[brwikimedia] Enable the CampaignEvents extension (T376747)]] [13:41:36] T376747: Enable CampaignEvents Extension on br.wikimedia - https://phabricator.wikimedia.org/T376747 [13:41:59] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1077417 would also be nice to deploy but probably won’t happen this window [13:42:06] (03CR) 10Xcollazo: "Looks like the following also needs to be removed:" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [13:42:14] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1028.eqiad.wmnet [13:42:43] !log brouberol@cumin1002 END (ERROR) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=97) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [13:43:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T367856)', diff saved to https://phabricator.wikimedia.org/P69516 and previous config saved to /var/cache/conftool/dbconfig/20241009-134305-ladsgroup.json [13:43:07] (03PS1) 10Elukey: kask: use if instead of with in _config.yaml to skip tls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) [13:43:09] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:43:11] (03CR) 10Bking: [C:03+1] Allow underscores, hyphens, and dots in hdfs_file names [puppet] - 10https://gerrit.wikimedia.org/r/1078940 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [13:43:40] (03CR) 10Btullis: [C:03+2] Allow underscores, hyphens, and dots in hdfs_file names [puppet] - 10https://gerrit.wikimedia.org/r/1078940 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [13:43:42] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test1004.wikimedia.org [13:43:45] (03PS1) 10Klausman: aptrepo: Add more missing packages to the rocm61 import, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1078943 (https://phabricator.wikimedia.org/T37638) [13:43:52] !log lucaswerkmeister-wmde@deploy2002 albertoleoncio, lucaswerkmeister-wmde: Backport for [[gerrit:1078774|[brwikimedia] Enable the CampaignEvents extension (T376747)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:44:03] (03CR) 10Klausman: [C:03+2] aptrepo: Add more missing packages to the rocm61 import, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1078943 (https://phabricator.wikimedia.org/T37638) (owner: 10Klausman) [13:44:03] Looks good here [13:44:03] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org [13:44:03] albertoleoncio: can you test using WikimediaDebug? [13:44:06] ok! [13:44:08] !log lucaswerkmeister-wmde@deploy2002 albertoleoncio, lucaswerkmeister-wmde: Continuing with sync [13:44:35] (03PS2) 10Elukey: kask: use if instead of with in _config.yaml to skip tls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) [13:44:52] (03PS1) 10Muehlenhoff: Switch cloudcephosd1028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078945 (https://phabricator.wikimedia.org/T349619) [13:44:54] yup, https://br.wikimedia.org/wiki/Especial:AllEvents definitely shows a nonzero amount of events [13:44:57] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet [13:45:00] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host flink-zk1001.eqiad.wmnet [13:45:21] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet [13:45:37] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078945 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:46:00] klausman: I'll merge your patch alomg [13:46:04] ty! [13:47:05] merged [13:48:00] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1004.wikimedia.org [13:48:26] (03PS1) 10Slyngshede: IDP: Failover IDP service to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/1078946 [13:48:37] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078774|[brwikimedia] Enable the CampaignEvents extension (T376747)]] (duration: 07m 04s) [13:48:40] T376747: Enable CampaignEvents Extension on br.wikimedia - https://phabricator.wikimedia.org/T376747 [13:48:43] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2004.wikimedia.org [13:48:55] !log UTC afternoon backport+config window done [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:06] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1001.eqiad.wmnet [13:49:16] jouncebot: nowandnext [13:49:16] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1300) [13:49:16] In 0 hour(s) and 10 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1400) [13:49:31] Lucas_WMDE: Thanks! =D [13:49:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1028.eqiad.wmnet [13:49:39] np ^^ [13:50:11] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup[1010-1011].eqiad.wmnet with reason: T376800 [13:50:25] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup[1010-1011].eqiad.wmnet with reason: T376800 [13:50:44] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1003.eqiad.wmnet [13:50:57] (03CR) 10Elukey: [C:04-1] "Nope this is not the correct approach, it should already be working." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey) [13:51:11] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2004.wikimedia.org [13:51:24] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2005.wikimedia.org [13:51:57] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [13:52:12] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [13:52:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [13:52:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:52:59] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox [13:52:59] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org [13:53:00] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:53:19] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:54:21] (03PS2) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) [13:54:30] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1003.eqiad.wmnet [13:55:19] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2005.wikimedia.org [13:56:07] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [13:57:21] (03PS3) 10Elukey: services: skip the kask's tls config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) [13:57:39] (03PS1) 10Muehlenhoff: Add members of platform-engineering to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1078948 (https://phabricator.wikimedia.org/T376808) [13:57:44] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:57:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [13:57:47] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1002.eqiad.wmnet [13:58:02] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:58:09] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Phase out platform-engineering POSIX group - https://phabricator.wikimedia.org/T376808 (10MoritzMuehlenhoff) 03NEW [13:58:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P69517 and previous config saved to /var/cache/conftool/dbconfig/20241009-135812-ladsgroup.json [13:58:33] (03CR) 10Slyngshede: [C:03+2] IDP: Failover IDP service to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/1078946 (owner: 10Slyngshede) [13:58:45] (03CR) 10Ladsgroup: "In paper, half an hour but we usually wait longer (for days) to make sure all hosts that were shut down or had puppet disabled get the upd" [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1400) [14:00:25] (03CR) 10Hnowlan: [C:03+2] api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [14:00:28] (03CR) 10Kosta Harlan: "this patch is all we need, yes." [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [14:01:04] * James_F waves. [14:01:31] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1002.eqiad.wmnet [14:01:38] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:02:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [14:02:38] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:03:02] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-09-24-145528 to 2024-10-08-175830 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078949 (https://phabricator.wikimedia.org/T347086) [14:03:08] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-09-24-221243 to 2024-10-08-175510 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078950 (https://phabricator.wikimedia.org/T347086) [14:03:10] (03PS3) 10Kosta Harlan: dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) [14:03:13] (03PS1) 10Jforrester: wikifunctions: Enable Wikidata dereferencing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078951 (https://phabricator.wikimedia.org/T370072) [14:03:14] (03CR) 10Ladsgroup: [C:03+2] dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [14:03:16] (03CR) 10Ladsgroup: [V:03+2 C:03+2] dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [14:03:30] (03PS1) 10Klausman: aptrepo: Add more missing packages to the rocm61 import, pt3 [puppet] - 10https://gerrit.wikimedia.org/r/1078952 (https://phabricator.wikimedia.org/T37638) [14:03:55] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp2004.wikimedia.org [14:04:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [14:04:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [14:05:31] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-09-24-145528 to 2024-10-08-175830 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078949 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester) [14:05:49] (03CR) 10Klausman: [C:03+2] aptrepo: Add more missing packages to the rocm61 import, pt3 [puppet] - 10https://gerrit.wikimedia.org/r/1078952 (https://phabricator.wikimedia.org/T37638) (owner: 10Klausman) [14:05:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:06:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet [14:06:20] (03PS3) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) [14:06:25] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-09-24-145528 to 2024-10-08-175830 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078949 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester) [14:06:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10214438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm [14:06:40] (03CR) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:06:47] FIRING: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:06:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:58] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org [14:07:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10214450 (10elukey) This is the current error: ` Applying Network changes to the BMC. Error while configuring BIOS or mgmt interface: PATCH https://10... [14:07:54] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2004.wikimedia.org [14:07:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [14:07:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:08:51] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [14:08:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [14:08:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:09:17] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078948 (https://phabricator.wikimedia.org/T376808) (owner: 10Muehlenhoff) [14:09:19] (03CR) 10FNegri: [C:03+1] "LGTM, we can refine further when we have some real-world examples." [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:09:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [14:09:26] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:45] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:01] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:11:03] !log installing Apache security updates [14:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] (03CR) 10Elukey: [C:03+1] "TIL profile::base::production::role_description" [puppet] - 10https://gerrit.wikimedia.org/r/1078942 (owner: 10Muehlenhoff) [14:11:22] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:11:27] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:38] 06SRE, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Open a new WMA-core mailing list - https://phabricator.wikimedia.org/T37638#10214465 (10Ladsgroup) @klausman Hi, Are you sure the ticket your connecting your patches to is the correct one? First patch was fine, but this is the second patch. [14:11:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2033.codfw.wmnet [14:12:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [14:12:23] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:12:37] 06SRE, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Open a new WMA-core mailing list - https://phabricator.wikimedia.org/T37638#10214477 (10klausman) >>! In T37638#10214464, @Ladsgroup wrote: > @klausman Hi, Are you sure the ticket your connecting your patches to is the correct one? First patch was fin... [14:13:08] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-09-24-221243 to 2024-10-08-175510 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078950 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester) [14:13:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P69519 and previous config saved to /var/cache/conftool/dbconfig/20241009-141319-ladsgroup.json [14:13:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2033.codfw.wmnet [14:14:35] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-09-24-221243 to 2024-10-08-175510 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078950 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester) [14:14:42] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:14:53] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:16:06] (03CR) 10Muehlenhoff: [C:03+2] Fix /etc/issue for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1078942 (owner: 10Muehlenhoff) [14:16:50] (03PS1) 10Ssingh: P:ntp: increase check and retry interval [puppet] - 10https://gerrit.wikimedia.org/r/1078953 [14:17:10] (03PS2) 10Ssingh: P:ntp: increase check_interval [puppet] - 10https://gerrit.wikimedia.org/r/1078953 [14:17:28] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:18:18] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4258/console" [puppet] - 10https://gerrit.wikimedia.org/r/1078953 (owner: 10Ssingh) [14:18:23] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:18:29] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and A:wikidough [14:18:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [14:18:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [14:18:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:18:50] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:19:21] (03CR) 10Hnowlan: [C:03+1] services: skip the kask's tls config in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey) [14:19:52] (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: increase check_interval [puppet] - 10https://gerrit.wikimedia.org/r/1078953 (owner: 10Ssingh) [14:19:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [14:20:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [14:20:09] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:20:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [14:20:25] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10214504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed w... [14:20:25] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:20:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [14:20:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [14:21:05] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev [14:21:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb2004-dev [14:21:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2033.codfw.wmnet [14:21:22] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:21:45] (03PS4) 10Elukey: services: skip the kask's tls config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) [14:21:47] !log sudo cumin 'O:alerting_host' 'run-puppet-agent' [14:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:49] (03PS1) 10Arturo Borrero Gonzalez: wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) [14:21:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:58] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org [14:22:04] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage [14:22:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev [14:23:01] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:23:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb2004-dev [14:23:40] !log failover master for ganeti/routed to ganeti2033 [14:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:44] (03CR) 10CI reject: [V:04-1] wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:24:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69520 and previous config saved to /var/cache/conftool/dbconfig/20241009-142404-ladsgroup.json [14:24:07] T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652 [14:24:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10214546 (10phaultfinder) [14:24:58] (03CR) 10Scott French: [C:03+1] "Yeah, I believe this is the right way to go, especially for non-staging where we need to leave certs.cassandra intact (which is why I was " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey) [14:25:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10214548 (10elukey) I checked the firmware version of the BMC and I got: `'Oem': {'Supermicro': {'UniqueFilename': 'BMC_X12AST2600-ROT-5201MS_20221105_... [14:25:22] (03PS1) 10JMeybohm: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078957 (https://phabricator.wikimedia.org/T376795) [14:25:39] (03CR) 10Elukey: [C:03+2] services: skip the kask's tls config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey) [14:27:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage [14:28:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T367856)', diff saved to https://phabricator.wikimedia.org/P69521 and previous config saved to /var/cache/conftool/dbconfig/20241009-142826-ladsgroup.json [14:28:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [14:28:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:28:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [14:28:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T367856)', diff saved to https://phabricator.wikimedia.org/P69522 and previous config saved to /var/cache/conftool/dbconfig/20241009-142848-ladsgroup.json [14:29:15] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1198.eqiad.wmnet onto db1157.eqiad.wmnet [14:30:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:30:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [14:31:13] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:31:26] ^ reboots, expected [14:31:44] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync [14:31:55] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [14:32:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:33:48] (03CR) 10JMeybohm: [C:03+2] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078957 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [14:34:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [14:34:34] (03PS2) 10Arturo Borrero Gonzalez: wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) [14:34:39] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:35:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org [14:36:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [14:39:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [14:39:44] (03PS5) 10Herron: add links to SLOs migrated to pyrra [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1077966 (https://phabricator.wikimedia.org/T302995) [14:39:54] (03CR) 10Herron: [V:03+2 C:03+2] add links to SLOs migrated to pyrra [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1077966 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [14:42:13] (03PS1) 10Elukey: sre.hosts.provision: warn when the BMC firmware is old for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) [14:42:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:39] (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM. One nit inline, not 100% essential. I'll work on setting the cloudsw side up to match the addresses used here." [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [14:43:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev [14:43:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb2004-dev [14:43:42] (03PS1) 10Muehlenhoff: Add jtweed to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/1078962 (https://phabricator.wikimedia.org/T376777) [14:44:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:44:25] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:44:41] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:44:53] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:44:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum [14:44:58] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:45:28] (03CR) 10Muehlenhoff: [C:03+2] Add jtweed to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/1078962 (https://phabricator.wikimedia.org/T376777) (owner: 10Muehlenhoff) [14:45:36] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:45:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [14:46:43] (03PS2) 10Elukey: sre.hosts.provision: warn when the BMC firmware is old for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) [14:47:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:47:24] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:47:29] (03PS1) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) [14:47:30] !log brouberol@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling restart_daemons on P{cephosd1001*} and (A:cephosd) [14:47:43] (03CR) 10Elukey: "Tested on backup1012, worked nicely." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:47:51] !log brouberol@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling restart_daemons on P{cephosd1001*} and (A:cephosd) [14:49:43] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10214623 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Hi Jonathan, I've added you to the cn=wmf LDAP group. You should be able to access the service... [14:50:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org [14:50:33] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:50:37] (03CR) 10Scott French: "Diffs look like what I'd expect :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [14:50:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet [14:51:07] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:51:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:51:35] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:51:47] RESOLVED: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:52:00] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:52:53] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup[2010-2011].codfw.wmnet with reason: T376800 [14:53:07] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup[2010-2011].codfw.wmnet with reason: T376800 [14:53:12] (03CR) 10Volans: sre.hosts.provision: warn when the BMC firmware is old for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:54:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet [14:54:34] (03CR) 10Hnowlan: [C:03+1] echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [14:55:20] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10214641 (10elukey) Last issue worth to report is T371416#10214548. The backup1012 host seems to have a very old fir... [14:57:51] (03CR) 10Elukey: [C:03+1] "In sessionstore I added also:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [14:58:57] !log brouberol@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling restart_daemons on P{cephosd1001*} and (A:cephosd) [14:59:21] !log brouberol@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling restart_daemons on P{cephosd1001*} and (A:cephosd) [15:00:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host crm2001.codfw.wmnet [15:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:36] (03CR) 10Elukey: sre.hosts.provision: warn when the BMC firmware is old for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:01:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [15:02:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqsin with reason: router replacement [15:03:04] (03PS1) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) [15:03:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqsin with reason: router replacement [15:03:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqsin with reason: router replacement [15:04:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqsin with reason: router replacement [15:04:19] (03PS5) 10Brouberol: wip: ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [15:04:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host crm2001.codfw.wmnet [15:04:39] (03Abandoned) 10Hnowlan: trafficserver: route knowledge-gap path via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/946928 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [15:04:47] (03PS2) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) [15:05:21] (03PS2) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) [15:05:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet [15:06:40] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org [15:06:58] (03CR) 10Scott French: "Excellent catch, Luca. So, it turns out it this _was_ using 8082 for both, which would have encountered the same bind error you observed. " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [15:07:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns1006.wikimedia.org [15:07:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns1006.wikimedia.org [15:07:04] (03CR) 10Elukey: "Please ping Infrastructure Foundations before adding new Posix groups :)" [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:07:07] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10214673 (10Aklapper) @MoritzMuehlenhoff: Per the steps on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access , please also add new `ldap/wmf`... [15:09:04] (03PS1) 10Cwhite: opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429) [15:09:10] (03CR) 10Elukey: "Technically this adds a SUDO rule so IIRC it should wait for the Infrastructure Foundations meeting that happens every Monday, in practice" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:09:23] (03CR) 10Xcollazo: [C:03+1] dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [15:09:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet [15:09:59] !log people.wikimedia.org - rebooting backends [15:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:58] (03CR) 10CI reject: [V:04-1] opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429) (owner: 10Cwhite) [15:11:40] (03PS6) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [15:12:30] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10214692 (10ssingh) @MSantos: Are you still the person responsible for approving these requests? If yes, this needs your approval. If not, apologies for adding you and please feel free to remove... [15:16:29] (03PS10) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [15:16:39] (03PS1) 10Muehlenhoff: Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 [15:16:45] (03CR) 10Alexandros Kosiaris: [C:03+1] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078957 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm) [15:16:50] (03CR) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:16:52] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:17:09] !log planet.wikimedia.org - rebooting backends [15:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10214744 (10Papaul) @ayounsi thanks for the feedback [15:17:35] (03CR) 10Arnaudb: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff) [15:17:51] (03CR) 10Muehlenhoff: [C:03+2] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff) [15:17:53] (03CR) 10JHathaway: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff) [15:17:56] (03CR) 10Dzahn: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff) [15:17:56] (03CR) 10Jcrespo: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff) [15:17:57] (03PS1) 10Cathal Mooney: Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) [15:19:05] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10214754 (10Papaul) @cmooney thanks for the feedback for the migration let us work with the way it is setup for know and we can look into all t... [15:19:12] (03CR) 10CI reject: [V:04-1] Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) (owner: 10Cathal Mooney) [15:19:17] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:19:58] (03CR) 10Dzahn: "any ticket related to this?" [dns] - 10https://gerrit.wikimedia.org/r/1078946 (owner: 10Slyngshede) [15:20:53] !log running dummy authdns-update [15:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:40] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org [15:22:12] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPv6 reverse entry for cloudsw1-b1-codfw interface IPs - cmooney@cumin1002" [15:22:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPv6 reverse entry for cloudsw1-b1-codfw interface IPs - cmooney@cumin1002" [15:22:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:27] (03PS2) 10Cathal Mooney: Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) [15:23:34] !log stewards* - rebooting machines - T351202 [15:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:37] T351202: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202 [15:24:04] !log fabfur@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: eqsin cr replacementAA, T375961] [15:24:29] !log fabfur@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site eqsin [reason: eqsin cr replacementAA, T375961] [15:24:35] !log fabfur@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: eqsin cr replacement, T375961] [15:24:38] !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: eqsin cr replacement, T375961] [15:25:12] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:25:13] !log eqsin depooled for T375961 [15:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:01] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache idp.wikimedia.org on all recursors [15:26:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp.wikimedia.org on all recursors [15:26:25] (03CR) 10Volans: "I know is a WIP, but given our chat on IRC I got curious and left some comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [15:26:42] (03PS3) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) [15:27:22] (03CR) 10Klausman: "I have also tested just now: radeontop already works fine without elevated privileges. That is, once you're in the render group, you don't" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:28:25] (03PS4) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) [15:30:25] (03PS1) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) [15:30:31] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org [15:30:57] (03CR) 10CI reject: [V:04-1] stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [15:31:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [15:33:45] (03PS2) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) [15:34:15] (03CR) 10Scott French: "Alright, looking at the diffs once overriding `app.port`, there's a gotcha here: the prometheus port annotation is now incorrect - i.e., `" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [15:34:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [15:38:43] (03PS3) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) [15:39:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [15:40:10] (03CR) 10Ssingh: [C:03+1] "Compared against Netbox and did the v6 PTR fun. Looks good 😊" [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) (owner: 10Cathal Mooney) [15:40:43] (03PS2) 10Cwhite: opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429) [15:41:47] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) (owner: 10Cathal Mooney) [15:43:20] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and A:wikidough [15:43:35] (03PS7) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [15:43:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:43:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:43:54] FIRING: [2x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5005 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [15:43:56] (03CR) 10Brouberol: Define a ceph rolling restart/reboot cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [15:44:01] oh ok [15:44:03] good that it paged [15:44:04] expected [15:44:08] !incidents [15:44:08] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [15:44:08] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [15:44:09] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [15:44:09] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [15:44:09] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [15:44:09] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [15:44:09] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [15:44:10] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [15:44:18] !incidents [15:44:18] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [15:44:18] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [15:44:18] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [15:44:19] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [15:44:19] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [15:44:19] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [15:44:19] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [15:44:20] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [15:44:24] hmm [15:44:26] thanks, sukhe! [15:44:28] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reboot-single for host mx-in1001.wikimedia.org [15:44:31] hmmm ... that's odd [15:44:37] yeah [15:44:41] hasn't percolated to VO yet? [15:44:52] my pager still has not gone off, so no? [15:45:28] that's quite the lag [15:45:31] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org [15:45:54] (03PS8) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [15:46:53] VO is still green to me [15:47:03] same here, and also on the web interface [15:47:12] this is not a false postiive, it's an actual alert. so it's not even taht [15:47:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:48:18] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx-in1001.wikimedia.org [15:48:23] !incidents [15:48:24] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [15:48:24] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [15:48:25] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [15:48:25] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [15:48:25] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [15:48:25] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [15:48:25] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [15:48:26] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [15:48:27] fun [15:48:32] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reboot-single for host mx-in2001.wikimedia.org [15:48:39] so yeah, I guess a ticket is in order [15:48:54] FIRING: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5004 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [15:49:04] I am going to downtime this in the meanwhile [15:49:30] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs[5004-5006].eqsin.wmnet with reason: site is depooled, cr2-eqsin is being replaced [15:49:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs[5004-5006].eqsin.wmnet with reason: site is depooled, cr2-eqsin is being replaced [15:51:18] so yeah, no sign of this page on VO anywhere [15:51:59] karma is not logging errors [15:52:31] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx-in2001.wikimedia.org [15:52:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:53:17] !log running authdns-update [15:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:54:34] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-reboot (exit_code=97) rolling reboot on A:dnsbox [15:55:04] ^ resolving issues with authdns-update [15:57:11] VO status page is green (see -private for a related thing though :D ) [15:58:00] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2005.wikimedia.org [15:58:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2005.wikimedia.org [16:00:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:01:30] (03CR) 10Btullis: stat hosts: enable zRAM-based swap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [16:02:51] (03CR) 10Btullis: stat hosts: enable zRAM-based swap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [16:03:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2001.codfw.wmnet with OS bookworm [16:03:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2002.codfw.wmnet with OS bookworm [16:03:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10214917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm [16:03:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10214918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm [16:04:12] (03CR) 10Cathal Mooney: "Agreed this looks good to proceed. I'll wait until someone from Traffic can give the nod however." [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [16:04:21] (03PS4) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) [16:04:54] (03CR) 10Scott French: "Looking at the current state of sessionstore prom metrics in staging, specifically at `up{kubernetes_namespace="sessionstore", prometheus=" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:05:09] (03CR) 10Volans: Define a ceph rolling restart/reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [16:05:25] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [16:07:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:24] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [16:21:28] !log forcing commit 95858bae44a2ccae5e7fb1fe793cd3bbc7ed9c6b through sre.dns.netbox [16:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:55] (03PS4) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) [16:23:06] (03CR) 10Bking: stat hosts: enable zRAM-based swap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [16:23:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [16:23:27] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: picking up zone file 1.0.e.f.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa - sukhe@cumin1002" [16:23:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: picking up zone file 1.0.e.f.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa - sukhe@cumin1002" [16:23:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:57] !log running authdns-update to fix broken zone files on dns2004 [16:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:49] (03PS3) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) [16:25:49] (03PS1) 10Scott French: kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) [16:30:30] (03CR) 10Scott French: "Alright, I think this should do the job. I'll merge this and update sessionstore staging to confirm metrics come back, then merge the next" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:30:30] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10214998 (10cmooney) The delegations for the 4 subnets used so far on the infra-side are working also: ` cmooney@cumin1002:~$... [16:31:16] (03CR) 10Scott French: "Alright, I think I have a solution, now stacked below this patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:32:01] !log starting requeueTranscodes on old school mwmaint2002 after the k8s blowup last night [16:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:25] (03PS5) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) [16:32:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [16:32:58] (03PS2) 10Scott French: kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) [16:32:59] (03PS4) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) [16:34:13] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:41:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1198.eqiad.wmnet onto db1157.eqiad.wmnet [16:44:38] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for codfw cr IPs facin cloudsw - cmooney@cumin1002" [16:44:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for codfw cr IPs facin cloudsw - cmooney@cumin1002" [16:44:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:44:46] (03PS1) 10Kosta Harlan: ipoid: Bump activeDeadlineSeconds to 24 hours [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078982 (https://phabricator.wikimedia.org/T374414) [16:46:04] (03CR) 10Máté Szabó: [C:03+2] ipoid: Bump activeDeadlineSeconds to 24 hours [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078982 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [16:47:06] (03Merged) 10jenkins-bot: ipoid: Bump activeDeadlineSeconds to 24 hours [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078982 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [16:48:09] (03PS3) 10Giuseppe Lavagetto: python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707 [16:48:09] (03PS4) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) [16:48:09] (03PS4) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) [16:48:09] (03PS1) 10Giuseppe Lavagetto: hiddenparma: add to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1078983 (https://phabricator.wikimedia.org/T371782) [16:48:11] (03PS1) 10Giuseppe Lavagetto: acme_chief: add SAN for requestctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1078984 (https://phabricator.wikimedia.org/T371782) [16:48:13] (03PS1) 10Giuseppe Lavagetto: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) [16:48:26] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [16:48:29] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:50:05] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [16:50:09] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:50:12] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:50:24] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:50:46] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [16:50:48] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [16:54:46] (03PS1) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) [16:55:19] (03CR) 10CI reject: [V:04-1] p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [16:58:37] (03CR) 10Elukey: [C:03+1] "Not 100% straightforward to read but it does the job, I'd test it to see how it works!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:58:58] (03PS2) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) [16:59:37] (03CR) 10CI reject: [V:04-1] p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [16:59:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69523 and previous config saved to /var/cache/conftool/dbconfig/20241009-165944-ladsgroup.json [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1700) [17:00:47] (03PS3) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) [17:01:02] (03CR) 10Elukey: [C:03+2] "Had a chat with Riccardo on meet, we are going to proceed with this one for the moment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [17:02:00] (03PS4) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) [17:03:53] (03CR) 10CI reject: [V:04-1] p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [17:04:22] (03PS5) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) [17:04:29] !incidents [17:04:29] 5307 (UNACKED) Host cr2-eqsin - PING - Packet loss = 100% [17:04:29] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [17:04:30] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [17:04:30] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [17:04:30] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [17:04:30] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [17:04:30] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [17:04:31] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [17:04:31] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [17:04:42] !ack 5307 [17:04:43] 5307 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [17:04:47] what's happening to alerts today? [17:04:50] sukhe: expected? [17:04:58] or is this just wildly old [17:04:58] earlier no VO alert now we got the alert via email but no here [17:05:24] (03PS6) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) [17:07:04] swfrench-wmf: expected yep [17:07:09] site is depooled [17:07:29] but this was downtimed [17:07:37] I think it might have expired? [17:07:59] yep I thought it was four but was two hours: https://phabricator.wikimedia.org/T375961#10214665 [17:08:00] I see the downtime was added at 15:04 UTC [17:08:34] I think we can ACK it but it will be up soon so might be worthwhile to not downtime it again [17:08:41] papaul and robh are working on it [17:08:55] (03PS1) 10RLazarus: deployment_server: Add `helm list` pagination to mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078988 (https://phabricator.wikimedia.org/T376795) [17:08:57] sounds good - thanks! [17:09:01] thanks and sorry for the noise [17:09:28] on that note, I will file a task for why we didn't get paged for the pybal alert [17:12:27] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org [17:12:28] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org [17:13:02] (03PS3) 10JHathaway: vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) [17:13:09] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert1002.wikimedia.org [17:13:22] (03CR) 10David Caro: [V:03+1] "Tested in toolsbeta: https://api.beta.toolforge.org" [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [17:13:47] (03CR) 10JHathaway: "Getting back to this, cut another patch with a few improvements, please take a look, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [17:14:56] (03CR) 10CI reject: [V:04-1] vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [17:15:12] (03CR) 10Scott French: [C:03+1] deployment_server: Add `helm list` pagination to mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078988 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [17:15:52] (03PS4) 10JHathaway: vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) [17:17:01] (03PS1) 10Cathal Mooney: Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48 [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) [17:18:02] bvibber: I'm deleting the artifacts from those video transcode jobs on kubernetes -- just FYI the logs won't be around for seven days as promised :) is that okay or do you need to collect anything first? [17:20:09] (03CR) 10RLazarus: [C:03+2] deployment_server: Add `helm list` pagination to mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078988 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [17:21:41] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host alert1002.wikimedia.org [17:23:00] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org [17:23:01] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org [17:23:10] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert1002.wikimedia.org [17:23:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm executed with errors: - mc-misc20... [17:23:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm executed with errors: - mc-misc20... [17:26:44] (03CR) 10Hnowlan: [C:03+1] kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:27:34] (03CR) 10Hnowlan: [C:03+1] echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:29:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69525 and previous config saved to /var/cache/conftool/dbconfig/20241009-172956-ladsgroup.json [17:31:42] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host alert1002.wikimedia.org [17:32:07] PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [17:32:20] ^ expected [17:33:13] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:34:16] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [17:34:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:35:24] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:35:28] (03CR) 10Scott French: [C:03+2] kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:36:30] (03Merged) 10jenkins-bot: kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:37:49] (03Abandoned) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1064828 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [17:38:17] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [17:40:58] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [17:41:36] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [17:41:50] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [17:42:29] (03CR) 10Ssingh: [C:03+1] "No concerns I can see: PTRs look good and I don't think there is any concern with the delegation (also not the first one to ns[01].opensta" [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [17:44:56] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [17:45:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69526 and previous config saved to /var/cache/conftool/dbconfig/20241009-174501-ladsgroup.json [17:45:55] (03CR) 10Scott French: "Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:45:57] (03CR) 10Scott French: [C:03+2] echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:46:40] rzl: that's fine, i don't need the output :D [17:46:56] thanks! [17:47:00] (03Merged) 10jenkins-bot: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [17:47:16] luv 2 break the site overnight \o/ [17:48:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:49:44] (03PS1) 10Ssingh: varnish: add pediapress.com to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1078994 [17:50:27] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4261/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (owner: 10Ssingh) [17:50:51] (03PS2) 10Ssingh: varnish: add pediapress.com to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) [17:51:33] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [17:51:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4262/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) (owner: 10Ssingh) [17:51:46] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [17:53:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2001.codfw.wmnet with OS bookworm [17:53:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2002.codfw.wmnet with OS bookworm [17:53:59] (03CR) 10Ssingh: [V:03+1] "This should not be merged until the request has been approved in the ticket above but the patch exists." [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) (owner: 10Ssingh) [17:54:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm [17:54:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215246 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm [17:58:19] !log zabe@mwmaint2002:~$ cat /home/zabe/s5.txt | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php {} --skip /home/zabe/text_table_cleanup/{} --dump /home/zabe/text_table_dump/{} --sleep 1" # T183490 [17:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:22] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [17:58:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:44] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [18:01:54] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [18:03:08] RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.67 ms [18:03:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:04:09] FIRING: HelmReleaseBadStatus: Helm release echostore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=echostore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:06:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc2001.codfw.wmnet with reason: host reimage [18:07:50] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [18:08:58] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet [18:09:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc2001.codfw.wmnet with reason: host reimage [18:10:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:10:53] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:10:56] RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.56 ms [18:11:04] yayay [18:11:23] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:11:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:11:46] nice! :) [18:12:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc2002.codfw.wmnet with reason: host reimage [18:13:33] RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 223.83 ms [18:13:57] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:14:45] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc2002.codfw.wmnet with reason: host reimage [18:15:08] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet [18:15:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:32] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs[5004-5006].eqsin.wmnet [18:15:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs[5004-5006].eqsin.wmnet [18:15:57] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:16:01] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [18:16:57] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [18:18:20] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet [18:18:21] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:18:59] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:01] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:23:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:24:30] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:eqsin and A:dnsbox [18:24:30] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org [18:24:45] (03PS1) 10Scott French: Revert "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079000 (https://phabricator.wikimedia.org/T376766) [18:25:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10215329 (10phaultfinder) [18:26:05] (03CR) 10Scott French: [C:03+2] Revert "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079000 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [18:26:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:26:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:26:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69527 and previous config saved to /var/cache/conftool/dbconfig/20241009-182632-ladsgroup.json [18:26:35] T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652 [18:26:43] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus5002.eqsin.wmnet [18:26:52] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1198.eqiad.wmnet onto db1212.eqiad.wmnet [18:27:03] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:27:12] (03Merged) 10jenkins-bot: Revert "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079000 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [18:28:05] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:28:15] ^ expected [18:28:18] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet [18:28:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [18:29:05] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [18:29:30] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [18:29:47] PROBLEM - MariaDB Replica IO: s3 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1212.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1212.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:33:43] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [18:33:45] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE)) [18:33:45] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns5003 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:34:07] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:34:09] RESOLVED: HelmReleaseBadStatus: Helm release echostore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=echostore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:34:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:34:21] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet [18:34:53] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org [18:35:28] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [18:35:38] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:37:01] this one is interesting [18:37:53] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:38:07] PROBLEM - MariaDB Replica Lag: s3 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:38:11] PROBLEM - MariaDB Replica Lag: s3 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:38:13] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:38:50] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet [18:38:57] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:38:59] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:39:42] ACKNOWLEDGEMENT - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). Sukhbir Singh ntpsec needs a restarts - The acknowledgement expires at: 2024-10-10 15:00:00. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:39:42] ACKNOWLEDGEMENT - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). Sukhbir Singh ntpsec needs a restarts - The acknowledgement expires at: 2024-10-10 15:00:00. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:39:42] ACKNOWLEDGEMENT - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). Sukhbir Singh ntpsec needs a restarts - The acknowledgement expires at: 2024-10-10 15:00:00. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:40:11] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:39] ^ the above is the alert1001 and 2001 removals. will issue a restart later [18:41:36] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet [18:45:20] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet [18:47:45] FIRING: ProbeDown: Service install5002:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:53] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org [18:51:15] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:51:26] RESOLVED: ProbeDown: Service install5002:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:52:30] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:15] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:00:15] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:01:03] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns5004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [19:01:17] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:04:23] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org [19:04:23] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:eqsin and A:dnsbox [19:04:40] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and A:dnsbox [19:04:40] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org [19:05:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:08:21] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:08:27] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:08:31] ^ expected [19:08:45] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:19] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:21] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:49] 10ops-eqsin: Inbound interface errors - https://phabricator.wikimedia.org/T376837 (10phaultfinder) 03NEW [19:13:27] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:13:45] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:13:55] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7001 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [19:14:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:15:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:23] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:17:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:23] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org [19:20:25] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:21:29] ^ this is doh5001 flapping. I will wait for maintenance to settle down and then we can check this [19:24:25] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:25] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:30] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [19:27:43] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [19:27:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:27:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc2001.codfw.wmnet with OS bookworm [19:28:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:28:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc2002.codfw.wmnet with OS bookworm [19:28:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm completed: - mc-misc2001 (**PASS*... [19:28:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215505 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm completed: - mc-misc2002 (**WARN*... [19:29:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215506 (10Jhancock.wm) [19:31:27] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:19] (03PS1) 10Scott French: Revert^2 "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079005 (https://phabricator.wikimedia.org/T376766) [19:32:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215511 (10Jhancock.wm) 05Open→03Resolved @jijiki this is ready for you. [19:34:59] !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [19:35:18] !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [19:35:23] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org [19:36:23] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:37:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:38:11] !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [19:38:27] !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [19:38:29] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:38:53] (03PS1) 10Ladsgroup: mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006 [19:39:05] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:39:15] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:39:23] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:43:18] (03CR) 10Ladsgroup: "This was what was missing for pc5, I'm not sure whether we should do it or not but documenting it at least." [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [19:43:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:44:51] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [19:45:05] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:45:15] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:45:29] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:46:00] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org [19:46:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and A:dnsbox [19:49:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:54:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:55:34] !log removing echostore staging deployment to unblock breaking change - T376766 [19:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:37] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [19:55:44] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2006*} and A:dnsbox [19:55:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org [19:56:29] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:57:00] ^ there is sadly no way to silence this. I suspect the BFD session flapping is due to the really old version of Junos on the new cr2-eqsin cr [19:57:09] so we will wait for the Junos upgrade and then revisit this [19:57:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:59:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:59:09] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:59:17] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:59:25] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:02:29] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:03:05] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 369, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:03:09] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 287, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:03:17] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns2006 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:03:18] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:03:25] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:06:41] (03CR) 10Scott French: [C:03+2] Revert^2 "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079005 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [20:07:39] (03Merged) 10jenkins-bot: Revert^2 "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079005 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [20:08:22] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org [20:08:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2006*} and A:dnsbox [20:12:08] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [20:12:25] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [20:17:31] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and (A:esams or A:drmrs) and A:dnsbox [20:17:31] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org [20:21:27] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:21:45] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:24:47] RECOVERY - MariaDB Replica IO: s3 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:27:27] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:27:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:27:37] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3003 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:27:45] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:28:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:32:41] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org [20:33:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:34:53] RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:07] RECOVERY - MariaDB Replica Lag: s3 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:11] RECOVERY - MariaDB Replica Lag: s3 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:13] RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:38:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1198.eqiad.wmnet onto db1212.eqiad.wmnet [20:41:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:42:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:44:41] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org [20:46:23] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms [20:48:11] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:48:23] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:50:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:54:27] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:54:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:54:47] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:55:13] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:55:23] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:56:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69528 and previous config saved to /var/cache/conftool/dbconfig/20241009-205601-ladsgroup.json [20:56:07] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org [20:57:27] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:57:37] PROBLEM - NTP peers and stratum check on dns3004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown, stratum=-1 (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [20:58:17] RECOVERY - NTP peers and stratum check on dns3004 is OK: NTP OK: Offset -0.000921227 secs, stratum=2 https://wikitech.wikimedia.org/wiki/NTP [20:58:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T2100) [21:01:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:04:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:07:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:08:07] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org [21:10:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:11:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69529 and previous config saved to /var/cache/conftool/dbconfig/20241009-211107-ladsgroup.json [21:12:05] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:12:07] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:15:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:16:53] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6001 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:17:05] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:17:07] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:18:03] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org [21:18:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:19:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:22:45] !log [apt1002:~] $ sudo -i reprepro --component thirdparty/gitlab-bullseye update bullseye-wikimedia [21:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:26:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69530 and previous config saved to /var/cache/conftool/dbconfig/20241009-212612-ladsgroup.json [21:28:38] (03PS3) 10Scott French: echostore: adopt service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079012 (https://phabricator.wikimedia.org/T376766) [21:30:03] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org [21:32:32] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:32:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:33:23] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:34:15] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:34:39] (03CR) 10Cwhite: [C:03+2] opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429) (owner: 10Cwhite) [21:35:37] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:37:20] dzahn@cumin2002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade. [21:38:15] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:38:23] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:41:07] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:41:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69531 and previous config saved to /var/cache/conftool/dbconfig/20241009-214117-ladsgroup.json [21:42:30] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:42:36] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:44:07] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:44:14] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:44:57] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:45:16] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org [21:45:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and (A:esams or A:drmrs) and A:dnsbox [21:45:42] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [21:47:56] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2 [21:48:41] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2 [21:51:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:53:13] (03PS1) 10Bking: data-platform: alert on load15 > 32 [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) [21:53:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:54:15] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2 [21:55:04] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2 [21:55:12] (03CR) 10Ryan Kemper: [C:03+1] "We've discussed lowering this threshold with the hardcoded value of 32 for now and then in a future patch making it actually look at the n" [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) (owner: 10Bking) [21:57:03] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: release 20241009-3 [21:57:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:57:48] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: release 20241009-3 [21:59:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:00:24] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: release 20241009-3 [22:01:44] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: release 20241009-3 [22:03:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:07:50] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [22:09:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:09:40] (03CR) 10Dzahn: "When using the cookbook today, it failed:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [22:10:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:10:50] (03PS1) 10Dzahn: Revert "sre.gitlab.upgrade: also use the service name for the downtime" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 [22:10:57] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:11:11] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:11:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:11:55] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:12:06] (03CR) 10Dzahn: "reverting to see if it's gone - we need to upgrade" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [22:12:28] (03CR) 10EoghanGaffney: [C:03+1] Revert "sre.gitlab.upgrade: also use the service name for the downtime" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn) [22:12:46] (03CR) 10Dzahn: [C:03+2] "we need it" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn) [22:13:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:14:11] (03CR) 10Bking: [C:03+2] data-platform: alert on load15 > 32 [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) (owner: 10Bking) [22:15:21] (03Merged) 10jenkins-bot: data-platform: alert on load15 > 32 [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) (owner: 10Bking) [22:20:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:21:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:24:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:25:24] (03PS1) 10Dzahn: gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) [22:25:57] (03CR) 10CI reject: [V:04-1] gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:26:20] (03PS2) 10Dzahn: gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) [22:26:52] (03CR) 10CI reject: [V:04-1] gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:28:13] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3 [22:28:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:28:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [22:28:58] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3 [22:29:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:29:35] (03PS3) 10Dzahn: gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) [22:29:51] (03CR) 10Dzahn: [V:03+1 C:03+2] "unexpectedly also did not fix it. now going here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079026" [puppet] - 10https://gerrit.wikimedia.org/r/1078759 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10215833 (10phaultfinder) [22:30:56] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3 [22:33:43] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [22:35:11] (03CR) 10Dzahn: [C:03+2] "yea, after the revert was deployed the cookbook works again." [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn) [22:35:57] (03CR) 10Dzahn: [C:03+2] "was:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn) [22:35:59] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3 [22:36:49] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1079026/4264/" [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:37:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:42:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:43:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:47:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:50:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69532 and previous config saved to /var/cache/conftool/dbconfig/20241009-225055-ladsgroup.json [22:50:58] T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652 [22:51:15] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1198.eqiad.wmnet onto db1223.eqiad.wmnet [22:51:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:51:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1012.eqiad.wmnet with OS bookworm [22:52:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10215864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm compl... [22:52:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:52:30] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:53:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:55:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:56:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:58:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:01:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:02:28] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [23:05:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:06:26] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:06:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:07:30] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009 [23:07:45] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:09:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:10:11] (03CR) 10Scott French: "Thanks for this! No objections to this in the abstract, but I do want to understand the underlying motivation a bit better." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [23:13:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:18:27] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet [23:18:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:19:06] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1079026/4264/" [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [23:20:31] PROBLEM - Host logging-hd2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10215958 (10Jclark-ctr) [23:21:59] RECOVERY - Host logging-hd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [23:22:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10215955 (10Jclark-ctr) 05Open→03Resolved a:05Marostegui→03Jclark-ctr [23:22:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:25:48] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet [23:26:23] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [23:27:13] PROBLEM - Host logging-hd2002 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:35] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on prod hosts confirmed. fixed puppet run on gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [23:29:11] (03CR) 10Dzahn: "this is now unblocked since puppet is unbroken on the new machine and then installed rsync and other things. deploying tomorrow or soon th" [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [23:29:19] RECOVERY - Host logging-hd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [23:31:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:36:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079035 [23:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079035 (owner: 10TrainBranchBot) [23:41:28] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus1005.eqiad.wmnet [23:41:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:43:56] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet [23:44:31] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:44:40] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10215980 (10Dzahn) For the first time puppet runs just fine on the new hardware now, before it is in production. Also gerrit is deployed there already. Everything is in plac... [23:46:09] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10215987 (10Dzahn) Notably this also means **gerrit on bookworm ** seems to work. Since no more puppet issues, app deployed, same Java version. [23:49:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2003.wikimedia.org [23:51:17] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet [23:52:14] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [23:52:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:53:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status