[00:10:50] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:38:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119887 [00:38:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119887 (owner: 10TrainBranchBot) [00:47:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119887 (owner: 10TrainBranchBot) [00:51:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [00:56:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [01:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119888 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119888 (owner: 10TrainBranchBot) [01:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:25:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 349MiB (2% inode=32%): /tmp 349MiB (2% inode=32%): /var/tmp 349MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [01:29:42] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119888 (owner: 10TrainBranchBot) [01:35:26] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:46] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 167232392 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:51:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:05:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 347MiB (2% inode=32%): /tmp 347MiB (2% inode=32%): /var/tmp 347MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [02:09:52] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [02:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:46:02] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:52:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 345MiB (2% inode=32%): /tmp 345MiB (2% inode=32%): /var/tmp 345MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [03:55:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [04:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10555974 (10phaultfinder) [05:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10556008 (10phaultfinder) [05:35:26] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:54] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:46:07] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:59:32] FIRING: [2x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:02:20] RESOLVED: [2x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10556066 (10phaultfinder) [07:17:28] (03CR) 10Arnaudb: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1119718 (https://phabricator.wikimedia.org/T386297) (owner: 10Jelto) [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:06:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:07:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:09:12] (03CR) 10Ilias Sarantopoulos: [C:03+2] knserve-inference: add seccompProfile to the pod security context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117939 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:11:56] (03Merged) 10jenkins-bot: knserve-inference: add seccompProfile to the pod security context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117939 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:15:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 337MiB (2% inode=32%): /tmp 337MiB (2% inode=32%): /var/tmp 337MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [08:35:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [08:38:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:43:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:45:32] PROBLEM - MariaDB Replica SQL: s8 on db2200 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table wbt_property_terms is corrupt: try to repair it on query. Default database: wikidatawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:25] (03CR) 10JMeybohm: [C:03+2] admin_ng: Switch on enableJobSidecarController for toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119231 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [08:47:37] (03CR) 10JMeybohm: [C:03+1] toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [08:48:01] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:48:30] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:49:04] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:50:48] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:51:13] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:51:22] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [08:52:13] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:53:33] (03PS1) 10Filippo Giunchedi: udp2log: don't bail on invalid utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1120130 (https://phabricator.wikimedia.org/T386421) [08:53:34] PROBLEM - MariaDB Replica Lag: s8 on db2200 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:54:34] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [08:56:01] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Feb-Mar): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10556190 (10Nikerabbit) p:05Medium→03High [08:56:29] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:56:38] (03PS3) 10Andrew Bogott: openstack: puppet: Drop support for .wmflabs names [puppet] - 10https://gerrit.wikimedia.org/r/1095193 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [08:57:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:58:42] (03CR) 10Majavah: [C:03+2] openstack: puppet: Drop support for .wmflabs names [puppet] - 10https://gerrit.wikimedia.org/r/1095193 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [09:00:58] (03CR) 10Filippo Giunchedi: [C:03+1] sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [09:01:16] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:03:25] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:03:33] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:03:40] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:03:46] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:03:52] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:03:57] (03CR) 10JMeybohm: [C:03+1] sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [09:04:00] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:04:08] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:05:11] (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1120130 (https://phabricator.wikimedia.org/T386421) (owner: 10Filippo Giunchedi) [09:05:55] (03CR) 10Filippo Giunchedi: [C:03+2] query_service: clean up icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114381 (https://phabricator.wikimedia.org/T358029) (owner: 10Filippo Giunchedi) [09:06:28] (03CR) 10Filippo Giunchedi: [C:03+2] udp2log: don't bail on invalid utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1120130 (https://phabricator.wikimedia.org/T386421) (owner: 10Filippo Giunchedi) [09:07:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:09:26] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [09:09:49] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:10:09] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:10:34] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:16:48] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10556218 (10JMeybohm) From lunch discussion in Atlanta: It would be ideal if we could create a recording rule that... [09:22:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:24:32] RESOLVED: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:31:46] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10556225 (10fgiunchedi) Since we have to overwrite `instance` with the host instead of the router, that information... [09:35:26] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:13] (03PS1) 10Brouberol: opensearch: include the minor version in the apt component name [puppet] - 10https://gerrit.wikimedia.org/r/1120140 (https://phabricator.wikimedia.org/T380752) [09:42:03] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4939/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120140 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:51:14] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4940/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120140 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [10:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:24] (03CR) 10FNegri: [C:03+2] toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885) (owner: 10FNegri) [10:34:49] (03PS1) 10Majavah: wikitech: Unset $wgEnableCreativeCommonsRdf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120149 [10:45:58] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:46:07] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:55:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118485 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [10:59:56] (03CR) 10Lucas Werkmeister (WMDE): "Though I’m still waiting for confirmation from Product / ComCom, so please don’t deploy until I confirm :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118485 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T1100) [11:05:32] PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: Return code of 141 is out of bounds https://wikitech.wikimedia.org/wiki/Search%23Administration [11:25:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10556710 (10phaultfinder) [11:29:20] 06SRE, 06Traffic: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10556727 (10Fabfur) 05Open→03Resolved [11:38:04] (03PS5) 10Federico Ceratto: clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 [11:45:19] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 (owner: 10Federico Ceratto) [11:47:09] (03PS1) 10Aklapper: E_STRICT PHP constant deprecated since PHP 8.4 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120155 [11:47:18] (03PS1) 10Elukey: Revert^2 "admin_ng: enforce restricted PSS on ml-staging-codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120156 [11:51:38] (03CR) 10Ilias Sarantopoulos: [C:03+1] Revert^2 "admin_ng: enforce restricted PSS on ml-staging-codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120156 (owner: 10Elukey) [11:53:18] (03PS1) 10Aklapper: E_STRICT PHP constant deprecated since PHP 8.4 [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120158 [11:54:33] (03CR) 10Elukey: [C:03+2] Revert^2 "admin_ng: enforce restricted PSS on ml-staging-codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120156 (owner: 10Elukey) [11:55:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:55:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:58:37] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:59:31] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:05:35] (03PS2) 10Elukey: admin_ng: disable PSP binding for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115323 (https://phabricator.wikimedia.org/T369493) [14:47:14] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:49:02] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119874|Suppress login audit hook in local leg of SUL3 authentication (T385574 T385572)]] (duration: 27m 43s) [14:49:07] T385574: LoginNotify and SUL3: Two emails are sent for a new device login when logging in on a SUL3 wiki - https://phabricator.wikimedia.org/T385574 [14:49:08] T385572: SUL3: CheckUser is told about two successful logins with different IP addresses for one successful login on a SUL3 enabled wiki - https://phabricator.wikimedia.org/T385572 [14:49:13] Lucas_WMDE: all yours [14:49:16] thanks! [14:49:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118485 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:49:59] (03PS1) 10Federico Ceratto: Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1120187 [14:50:08] (03Merged) 10jenkins-bot: Enable fixed Wikibase RDF on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118485 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:50:14] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers maps1007.eqiad.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps1009.eqiad.wmnet, maps1010.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:50:24] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1118485|Enable fixed Wikibase RDF on Test Wikidata (T384344)]] [14:50:28] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:51:35] (03Abandoned) 10Federico Ceratto: Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1120187 (owner: 10Federico Ceratto) [14:52:45] (03PS2) 10Federico Ceratto: Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 [14:54:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:54:37] (03CR) 10Federico Ceratto: "Updated using 120 as line len." [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 (owner: 10Federico Ceratto) [14:54:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1118485|Enable fixed Wikibase RDF on Test Wikidata (T384344)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:54:57] (03PS3) 10Federico Ceratto: Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 [14:54:57] testing… [14:55:45] looks good – https://test.wikidata.org/wiki/Special:EntityData/Q469.ttl?flavor=dump exhibits the expected diff, https://www.wikidata.org/wiki/Special:EntityData/Q42.ttl?flavor=dump doesn’t change yet [14:55:52] (03CR) 10Federico Ceratto: [C:03+1] Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 (owner: 10Federico Ceratto) [14:55:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:56:26] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps2006.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:56:41] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:56:52] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:56:58] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:58:14] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:59:17] (03CR) 10Ladsgroup: [C:03+2] Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 (owner: 10Federico Ceratto) [15:02:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118485|Enable fixed Wikibase RDF on Test Wikidata (T384344)]] (duration: 12m 24s) [15:02:52] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [15:06:14] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1009.eqiad.wmnet, maps1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:06:49] (03CR) 10Ladsgroup: "not sure how it works but you can set it to WIP (look at three dots at the top of the page). You can also remove us and then add us back 😄" [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 (owner: 10Federico Ceratto) [15:07:14] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:07:39] !log UTC afternoon backport+config window done [15:07:40] * Lucas_WMDE done deploying [15:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:08:15] (03CR) 10Majavah: [C:03+2] wiki replicas: Drop special case handling for Wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1120174 (owner: 10Majavah) [15:10:13] (03PS1) 10Filippo Giunchedi: pontoon: fix puppetmaster lookup in enc [puppet] - 10https://gerrit.wikimedia.org/r/1120188 [15:11:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:14] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:13:09] !log restart all kartotherian services on maps1* - high unavalability [15:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:14] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:18:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119740 (https://phabricator.wikimedia.org/T383801) (owner: 10Clare Ming) [15:19:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:29:35] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120188 (owner: 10Filippo Giunchedi) [15:29:51] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix puppetmaster lookup in enc [puppet] - 10https://gerrit.wikimedia.org/r/1120188 (owner: 10Filippo Giunchedi) [15:31:19] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [15:35:48] (03PS1) 10JMeybohm: cert-manager: Allow prometheus to scrape all components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) [15:36:01] (03PS2) 10JMeybohm: cert-manager: Allow prometheus to scrape all components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) [15:41:26] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [15:42:25] FIRING: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:36] ^^^ am aware, working on it [15:52:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:25] FIRING: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [15:58:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:47] (03Merged) 10jenkins-bot: docroot: Add experimental assetlinks.json from and to various domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [15:59:03] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1119739|docroot: Add experimental assetlinks.json from and to various domains (T385520)]] [15:59:13] T385520: Investigate seamless credential sharing - https://phabricator.wikimedia.org/T385520 [15:59:24] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on relforge1004.eqiad.wmnet with reason: T380752 [15:59:27] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [16:00:21] (03PS7) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [16:02:50] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#10557396 (10cmooney) >>! In T366193#9851085, @cmooney wrote: > - Most major resolvers/dns providers appear to be 'smart' and pick the lowest-latency server (as per [[ https://datatracker.ietf.org/doc/html/rfc4697#section-2.... [16:02:55] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1119739|docroot: Add experimental assetlinks.json from and to various domains (T385520)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:04:16] !log krinkle@deploy2002 krinkle: Continuing with sync [16:05:36] (03PS8) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [16:06:58] (03PS9) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [16:11:56] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119739|docroot: Add experimental assetlinks.json from and to various domains (T385520)]] (duration: 12m 53s) [16:12:00] T385520: Investigate seamless credential sharing - https://phabricator.wikimedia.org/T385520 [16:12:12] (03PS10) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [16:21:01] (03PS11) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [16:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10557427 (10phaultfinder) [16:27:11] (03CR) 10Jforrester: "Oh, sorry, missed in the rename. Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1119831 (https://phabricator.wikimedia.org/T380807) (owner: 10Alexandros Kosiaris) [16:27:47] (03PS12) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [16:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T1630). [16:44:32] FIRING: SystemdUnitFailed: ferm.service on kubestage2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:41] !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubestage2003.codfw.wmnet [16:49:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2003.codfw.wmnet [16:52:21] RESOLVED: SystemdUnitFailed: ferm.service on kubestage2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:57:36] !log removed systemd override for haproxykafka on cp4037 (T378758) [16:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] T378758: Set CPU affinity for haproxykafka process - https://phabricator.wikimedia.org/T378758 [17:00:55] (03CR) 10Elukey: [C:03+2] conftool: add more k8s nodes for Kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1120186 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [17:25:02] !log zabe@mwmaint2002:~$ cat /home/zabe/group2.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/text_table_cleanup/{} --sleep 0.3" # T183490 [17:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:06] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [17:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10557557 (10phaultfinder) [17:37:20] (03CR) 10Pcoombe: [C:03+1] "This is also a blocker for being able to solve T364348" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [17:55:39] (03CR) 10Pppery: "Now somebody needs to get the attention of SRE. Neither the author nor the approver of the patch this is reverting have responded, nor did" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [17:56:29] (03CR) 10Pppery: "Correct link since Gerrit doesn't let you edit comments: https://wikitech.wikimedia.org/wiki/Puppet_request_window" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T1800) [18:00:04] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T1800). [18:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10557645 (10phaultfinder) [19:19:21] (03PS1) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) [19:19:43] (03CR) 10CI reject: [V:04-1] mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:23:06] (03PS1) 10WMDE-Fisch: [beta] Change sub-referencing feature flag to new name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120217 (https://phabricator.wikimedia.org/T373307) [19:37:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [19:47:54] (03PS1) 10Gergő Tisza: auth: Log actual error message for action=login [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120220 [19:48:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120220 (owner: 10Gergő Tisza) [19:52:55] (03PS2) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) [19:53:02] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:53:10] (03CR) 10Gergő Tisza: mediawiki: Add rewrite rule to fix serving of /.well-known static files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:53:16] (03PS3) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) [19:53:32] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:54:55] (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/output/1120216/2971/mwdebug1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:55:19] (03CR) 10Gergő Tisza: mediawiki: Add rewrite rule to fix serving of /.well-known static files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:55:28] (03CR) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [19:56:12] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:56:27] (03PS4) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) [19:58:16] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:58:37] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:58:56] (03CR) 10CI reject: [V:04-1] auth: Log actual error message for action=login [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120220 (owner: 10Gergő Tisza) [19:58:58] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:59:47] (03CR) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [20:00:40] (03CR) 10Gergő Tisza: [C:03+1] mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [20:00:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [20:01:55] (03CR) 10Gergő Tisza: "recheck" [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120220 (owner: 10Gergő Tisza) [20:02:38] (03CR) 10Pppery: [C:03+1] ncmonitor: Ignore wikipediacreators.com [puppet] - 10https://gerrit.wikimedia.org/r/1115996 (owner: 10BCornwall) [20:05:45] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [20:07:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:07:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [20:08:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:08] (03PS2) 10Michael Große: Growth: increase minimum tasks per topic for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) [20:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:21:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119879 (https://phabricator.wikimedia.org/T386561) (owner: 10Pppery) [20:22:21] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) (owner: 10Michael Große) [20:24:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) (owner: 10Michael Große) [20:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10557820 (10phaultfinder) [20:28:50] (03PS1) 10Michael Große: beta(Growth): enable Surfacing Add Link also on beta-cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120225 [20:29:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120225 (owner: 10Michael Große) [20:38:43] (03PS1) 10Fabfur: hiera: add haproxy dummy ring configuration everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1120228 (https://phabricator.wikimedia.org/T329332) [20:41:09] (03CR) 10Fabfur: "Didn't noticed I0adb0b9a2aee1a52bd2ec52336fe583b40b88762 so probably this is useless now, WDYT @cdanis@wikimedia.org?" [puppet] - 10https://gerrit.wikimedia.org/r/1120228 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [20:42:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:43:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:47:55] (03CR) 10Urbanecm: [C:03+2] beta(Growth): enable Surfacing Add Link also on beta-cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120225 (owner: 10Michael Große) [20:48:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120225 (owner: 10Michael Große) [20:48:38] (03Merged) 10jenkins-bot: beta(Growth): enable Surfacing Add Link also on beta-cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120225 (owner: 10Michael Große) [20:49:12] (03PS3) 10Michael Große: Growth: increase minimum tasks per topic for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) [20:49:15] (03CR) 10Urbanecm: [C:03+2] Growth: increase minimum tasks per topic for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) (owner: 10Michael Große) [20:49:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) (owner: 10Michael Große) [20:49:56] MichaelG_WMF: i went ahead and deployed those two a bit earlier, as one is beta-only and the other one affects only the maint script [20:49:57] (03Merged) 10jenkins-bot: Growth: increase minimum tasks per topic for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120224 (https://phabricator.wikimedia.org/T386248) (owner: 10Michael Große) [20:50:15] urbanecm: Thanks! [20:50:17] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1120224|Growth: increase minimum tasks per topic for 4 more wikis (T386248)]] [20:50:20] T386248: Bump the number of Add Link tasks per topic to 2000 for all Surfacing structured tasks pilot wikis - https://phabricator.wikimedia.org/T386248 [20:52:10] urbanecm: I just remembered that the beta-change was pointless. [20:52:18] what was wrong with it? [20:52:57] It has no variant. And we (I) merged the code checking for the variant... [20:53:56] there's no harm done, and that change is the prerequisite to making it available on beta-cswiki, but in itself, that change was effectively a no-op [20:54:50] well, it allows people to self-enroll though? [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T2100). [21:00:05] cjming, tgr, Pppery, and MichaelG_WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] i can deploy today [21:00:20] here [21:00:26] hey! [21:00:30] (03PS4) 10Pppery: Restrict unfuzzy on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119879 (https://phabricator.wikimedia.org/T386561) [21:00:30] hi urbanecm: you sure? i'm also happy to [21:00:32] (03CR) 10Urbanecm: [C:03+2] Restrict unfuzzy on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119879 (https://phabricator.wikimedia.org/T386561) (owner: 10Pppery) [21:00:50] cjming: unless you really want to do it yourself :)) [21:01:22] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120224|Growth: increase minimum tasks per topic for 4 more wikis (T386248)]] (duration: 11m 05s) [21:01:24] happy to have you deploy - "really want" doesn't quite capure it [21:01:25] o/ [21:01:26] T386248: Bump the number of Add Link tasks per topic to 2000 for all Surfacing structured tasks pilot wikis - https://phabricator.wikimedia.org/T386248 [21:01:28] (03PS2) 10Clare Ming: Re-enable test experiment for testwiki for upcoming demos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119740 (https://phabricator.wikimedia.org/T383801) [21:01:31] (03CR) 10Urbanecm: [C:03+2] Re-enable test experiment for testwiki for upcoming demos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119740 (https://phabricator.wikimedia.org/T383801) (owner: 10Clare Ming) [21:02:03] (03CR) 10Urbanecm: [C:03+2] auth: Log actual error message for action=login [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120220 (owner: 10Gergő Tisza) [21:02:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119879 (https://phabricator.wikimedia.org/T386561) (owner: 10Pppery) [21:02:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119740 (https://phabricator.wikimedia.org/T383801) (owner: 10Clare Ming) [21:02:07] (03Merged) 10jenkins-bot: Restrict unfuzzy on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119879 (https://phabricator.wikimedia.org/T386561) (owner: 10Pppery) [21:02:18] (03Merged) 10jenkins-bot: Re-enable test experiment for testwiki for upcoming demos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119740 (https://phabricator.wikimedia.org/T383801) (owner: 10Clare Ming) [21:02:38] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1119879|Restrict unfuzzy on Commons (T386561)]], [[gerrit:1119740|Re-enable test experiment for testwiki for upcoming demos (T383801)]] [21:02:43] T386561: Restrict unfuzzy on Commons - https://phabricator.wikimedia.org/T386561 [21:02:43] T383801: Remove Experimentation Lab's first test experiment - https://phabricator.wikimedia.org/T383801 [21:06:07] !log urbanecm@deploy2002 urbanecm, pppery, cjming: Backport for [[gerrit:1119879|Restrict unfuzzy on Commons (T386561)]], [[gerrit:1119740|Re-enable test experiment for testwiki for upcoming demos (T383801)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:12] Checked Special:ListGroupRights on Commons, looks as expected [21:07:16] ty [21:07:19] cjming: what about you? [21:07:32] lgtm! please sync [21:07:34] !log urbanecm@deploy2002 urbanecm, pppery, cjming: Continuing with sync [21:07:43] proceeding [21:08:39] ty! [21:09:26] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 44469MiB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [21:10:01] I'll add one more patch to the window [21:10:12] can self-deploy once the rest is done [21:10:30] (03PS1) 10Gergő Tisza: Lower log level of SUL3 start/end events [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120231 (https://phabricator.wikimedia.org/T377261) [21:10:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120231 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [21:13:25] (03Merged) 10jenkins-bot: auth: Log actual error message for action=login [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120220 (owner: 10Gergő Tisza) [21:14:38] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119879|Restrict unfuzzy on Commons (T386561)]], [[gerrit:1119740|Re-enable test experiment for testwiki for upcoming demos (T383801)]] (duration: 11m 59s) [21:14:46] T386561: Restrict unfuzzy on Commons - https://phabricator.wikimedia.org/T386561 [21:14:46] T383801: Remove Experimentation Lab's first test experiment - https://phabricator.wikimedia.org/T383801 [21:18:31] tgr|away: feel free to take over [21:19:13] urbanecm: do you want to deploy the config changes first? [21:19:40] tgr|away: all config changes in the calendar are deployed? [21:20:04] if there is any other config change, happy to, but i think i got all [21:21:11] there are two Growth patches [21:21:17] well, one for production [21:21:36] was that deployed before the window? [21:21:42] tgr|away: both of them are deployed, i started a bit early [21:21:51] oh, ok then [21:22:02] I'll take over then, thanks [21:22:05] np [21:22:15] (03CR) 10Gergő Tisza: [C:03+2] Lower log level of SUL3 start/end events [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120231 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [21:30:42] (03CR) 10Southparkfan: "Can we get this patch forward?" [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede) [21:31:10] (03Merged) 10jenkins-bot: Lower log level of SUL3 start/end events [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120231 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [21:32:35] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1120220|auth: Log actual error message for action=login]], [[gerrit:1120231|Lower log level of SUL3 start/end events (T377261)]] [21:32:38] T377261: Track the number of interrupted SUL3 logins / signups - https://phabricator.wikimedia.org/T377261 [21:35:18] !log tgr@deploy2002 tgr: Backport for [[gerrit:1120220|auth: Log actual error message for action=login]], [[gerrit:1120231|Lower log level of SUL3 start/end events (T377261)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:39:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:41:50] !log tgr@deploy2002 tgr: Continuing with sync [21:44:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:25] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120220|auth: Log actual error message for action=login]], [[gerrit:1120231|Lower log level of SUL3 start/end events (T377261)]] (duration: 15m 50s) [21:48:29] T377261: Track the number of interrupted SUL3 logins / signups - https://phabricator.wikimedia.org/T377261 [21:50:15] !log UTC late deploys done [21:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.371s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:57:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.371s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250217T2200). [22:09:26] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [22:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:56] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:55:56] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:19:26] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 45802MiB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops