[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0000) [00:14:00] (03PS1) 10Zabe: Increase revision-slots cache expiry back to default for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118592 [00:17:43] "NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start " [00:17:55] (03CR) 10Zabe: [C:03+2] Increase revision-slots cache expiry back to default for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118592 (owner: 10Zabe) [00:18:36] (03Merged) 10jenkins-bot: Increase revision-slots cache expiry back to default for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118592 (owner: 10Zabe) [00:19:56] (03PS1) 10Zabe: Add script to delete obsolote text table rows [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118594 [00:20:03] (03CR) 10Zabe: [C:03+2] Add script to delete obsolote text table rows [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118594 (owner: 10Zabe) [00:22:36] (03Merged) 10jenkins-bot: Add script to delete obsolote text table rows [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118594 (owner: 10Zabe) [00:23:51] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1118592|Increase revision-slots cache expiry back to default for more wikis]], [[gerrit:1118594|Add script to delete obsolote text table rows]] [00:26:37] !log zabe@deploy2002 zabe: Backport for [[gerrit:1118592|Increase revision-slots cache expiry back to default for more wikis]], [[gerrit:1118594|Add script to delete obsolote text table rows]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:26:47] !log zabe@deploy2002 zabe: Continuing with sync [00:33:33] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118592|Increase revision-slots cache expiry back to default for more wikis]], [[gerrit:1118594|Add script to delete obsolote text table rows]] (duration: 09m 41s) [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118595 [00:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118595 (owner: 10TrainBranchBot) [00:48:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118595 (owner: 10TrainBranchBot) [00:59:06] !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php test2wiki --delete /home/zabe/text_table_cleanup/test2wiki # T183490 [00:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:09] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [00:59:44] !log bd808@mwmaint2002 Wikitech: Renamed and attached 128 accounts claimed via Striker/Bitu after usurping their SUL name (T161859) [00:59:46] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [01:05:21] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 497MiB (3% inode=33%): /tmp 497MiB (3% inode=33%): /var/tmp 497MiB (3% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [01:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118596 [01:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118596 (owner: 10TrainBranchBot) [01:29:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118596 (owner: 10TrainBranchBot) [01:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:21] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [02:04:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10537562 (10phaultfinder) [02:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.16 [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1118598 (https://phabricator.wikimedia.org/T382367) [02:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.16 [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1118598 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [02:14:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [02:18:22] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.16 [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1118598 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [02:37:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10537588 (10phaultfinder) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0300) [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10537621 (10phaultfinder) [03:32:41] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [03:36:00] (03PS1) 10Daimona Eaytoy: Drop obsolete CampaignEvents config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118604 (https://phabricator.wikimedia.org/T380076) [03:36:54] (03PS3) 10Daimona Eaytoy: core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) [03:37:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118604 (https://phabricator.wikimedia.org/T380076) (owner: 10Daimona Eaytoy) [03:38:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [03:56:35] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1118498 (owner: 10L10n-bot) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0400) [04:02:04] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118605 (https://phabricator.wikimedia.org/T382367) [04:02:05] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118605 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [04:02:49] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118605 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [04:03:18] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.16 refs T382367 [04:03:21] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [04:28:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:33:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:49:47] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.16 refs T382367 (duration: 46m 29s) [04:49:50] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [04:56:00] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Feb-Mar): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10537677 (10KartikMistry) [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0500) [05:03:04] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.13 (duration: 03m 02s) [05:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10537685 (10phaultfinder) [05:17:09] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:39:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10537704 (10phaultfinder) [05:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [06:32:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:37:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:39:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [06:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0700) [07:00:05] marostegui, Amir1, and federico3: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0700). Please do the needful. [07:14:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:19:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10537756 (10phaultfinder) [07:19:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:27:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:32:41] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [07:58:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 77108232 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 133752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:04:54] !log restarting blazegraph on wdqs2016 to mitigate free allocator decreasing [08:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:13] inflatador, ryankemper, dcausse : ^ [08:12:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:19:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:00:04] andre and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T0900). [09:00:09] o/ [09:02:52] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118770 (https://phabricator.wikimedia.org/T382367) [09:02:53] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118770 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [09:03:42] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118770 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [09:20:44] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.16 refs T382367 [09:20:47] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [09:28:44] (03PS3) 10Arthur taylor: Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) [09:39:02] (03CR) 10Gehel: [C:03+2] Update the SSH key for btullis with a new public key [puppet] - 10https://gerrit.wikimedia.org/r/1118520 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [09:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:45] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10537993 (10Gehel) [10:04:34] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28), 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10538001 (10Gehel) [10:04:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10538011 (10Gehel) [10:15:27] (03PS1) 10Btullis: Restore groups to btullis account [puppet] - 10https://gerrit.wikimedia.org/r/1118778 (https://phabricator.wikimedia.org/T385943) [10:24:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [10:28:21] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:28:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:28:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10538240 (10phaultfinder) [10:30:25] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:39:56] (03PS3) 10Urbanecm: [Growth] Deploy Community updates to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118557 (https://phabricator.wikimedia.org/T384406) [10:40:18] (03CR) 10Brouberol: [C:03+1] Restore groups to btullis account [puppet] - 10https://gerrit.wikimedia.org/r/1118778 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [10:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:44:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [10:50:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:52:15] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10538292 (10Urbanecm) >>! In T385808#10532691, @KFrancis wrote: > Hi all, I checked my records, and I do have have a NDA for anything under kwa.schultz@gmail.com or user EP1C. What is this pe... [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T1100) [11:13:02] (03PS2) 10Lucas Werkmeister (WMDE): Enable fixed Wikibase RDF on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118484 (https://phabricator.wikimedia.org/T384344) [11:13:02] (03PS2) 10Lucas Werkmeister (WMDE): Enable fixed Wikibase RDF on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118485 (https://phabricator.wikimedia.org/T384344) [11:13:02] (03PS2) 10Lucas Werkmeister (WMDE): Enable fixed Wikibase RDF everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) [11:13:03] (03PS2) 10Lucas Werkmeister (WMDE): Remove Wikibase fixed RDF feature flag again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) [11:18:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118484 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [11:22:48] (03CR) 10Klausman: [C:03+2] admin_ng: lower memory limitranges for revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118122 (owner: 10AikoChou) [11:23:21] 10ops-eqiad, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083 (10Andrew) 03NEW [11:26:53] (03Merged) 10jenkins-bot: admin_ng: lower memory limitranges for revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118122 (owner: 10AikoChou) [11:27:04] (03PS1) 10Gmodena: cirrus: deploy new mlr models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) [11:30:19] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:30:50] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:31:35] (03PS1) 10Gmodena: cirrus: create buckets for mlr 2025 experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) [11:32:41] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [11:32:55] (03CR) 10Stevemunene: [C:03+1] Restore groups to btullis account [puppet] - 10https://gerrit.wikimedia.org/r/1118778 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [11:39:05] (03PS1) 10Brouberol: airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) [11:40:55] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10538485 (10fnegri) > It has been corrected by h/w and requires no further action Does this mean we s... [11:42:31] (03PS1) 10Gmodena: cirrus: enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) [11:43:11] (03PS2) 10Gmodena: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) [11:44:08] (03PS3) 10Gmodena: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) [11:47:30] (03PS3) 10Gmodena: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) [11:47:55] (03PS1) 10Gmodena: cirrus: create buckets for mlr 2025 experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) [12:00:54] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:01:44] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:28:58] (03CR) 10Btullis: [C:03+1] "Looks good. One minor question, but not blocking." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [12:34:48] (03CR) 10Urbanecm: [C:03+2] [Growth] Deploy Community updates to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118557 (https://phabricator.wikimedia.org/T384406) (owner: 10Urbanecm) [12:35:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118557 (https://phabricator.wikimedia.org/T384406) (owner: 10Urbanecm) [12:35:38] (03Merged) 10jenkins-bot: [Growth] Deploy Community updates to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118557 (https://phabricator.wikimedia.org/T384406) (owner: 10Urbanecm) [12:36:35] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1118557|[Growth] Deploy Community updates to all wikis (T384406)]] [12:36:38] T384406: Community Updates module: Release to all Wikipedias with GrowthExperiments - https://phabricator.wikimedia.org/T384406 [12:40:49] (03CR) 10Andrew Bogott: [C:03+2] Add wmcs-bastionless utility script [puppet] - 10https://gerrit.wikimedia.org/r/1118526 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [12:41:11] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1118557|[Growth] Deploy Community updates to all wikis (T384406)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:41:38] !log urbanecm@deploy2002 urbanecm: Continuing with sync [12:41:39] (03CR) 10Stevemunene: [C:03+2] Restore groups to btullis account [puppet] - 10https://gerrit.wikimedia.org/r/1118778 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [12:44:06] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10538613 (10Andrew) >>! In T386083#10538485, @fnegri wrote: >> It has been corrected by h/w and requir... [12:46:35] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10538626 (10Andrew) Browsing stack overflow implies that this is likely an impending HW issue [12:47:49] RESOLVED: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:48:04] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:48:18] RESOLVED: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:48:28] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118557|[Growth] Deploy Community updates to all wikis (T384406)]] (duration: 11m 52s) [12:48:32] T384406: Community Updates module: Release to all Wikipedias with GrowthExperiments - https://phabricator.wikimedia.org/T384406 [12:48:36] * urbanecm done [12:50:24] FIRING: [2x] ProbeDown: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:15] <_joe_> klausman: ^^ [12:51:31] on it [12:51:44] <_joe_> !incidents [12:51:45] 5670 (UNACKED) [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad) [12:51:51] <_joe_> !ack 5670 [12:51:51] 5670 (ACKED) [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad) [12:52:33] <_joe_> (sorry, we're at our offsite and I'm rehearsing my presentation, but ping if you need help, most people should be waking up now) [12:52:39] o/ [12:52:58] i'm around if needed [12:53:12] klausman: you need a hand? [12:53:41] Nah, all good [12:53:56] <_joe_> <3 [12:53:57] The alert has stopped firing it seems. Currently trying to figure out more detail [12:55:24] RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:49] (03CR) 10Brouberol: airflow: add kafka connections to configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [12:56:00] ok [12:56:05] Looks like the kube-api-server systemd service restarted at around 12:50 UTC [12:56:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:57:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:57:39] (03PS1) 10CDanis: aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) [12:57:39] klausman: that happens regularly when certificates are refreshes (on that apiserver or one of the others) [12:57:43] https://phabricator.wikimedia.org/P73432 AFAICT the service failed to talk to etcd, and then either exited or restarted itself [12:58:15] grrrr, jorunalctl insisting on not wrapping lines ... [12:58:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:58:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:58:59] klausman: systemctl status kube-apiserver-safe-restart.service [12:59:18] (03PS2) 10Brouberol: airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) [12:59:34] that triggered the restart and it's what confd calls when refreshing certs from etcd [12:59:36] (03CR) 10Brouberol: airflow: add kafka connections to configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T1300) [13:00:09] aha, so the probedown was just unlucky timing [13:00:23] we've seen in wikikube that starting apiserver took longer than the alert threshold at some point (load, number of objects in etcd, ...) [13:00:44] it seems related to knative thought [13:00:47] *though [13:00:53] (03PS1) 10Andrew Bogott: Add alerting for cloud-vps users outside of the Bastion project [alerts] - 10https://gerrit.wikimedia.org/r/1118798 (https://phabricator.wikimedia.org/T379550) [13:00:58] morning luca :) [13:01:03] morning :) [13:01:07] updated the paste to include long lines [13:01:13] how's it related to knative? [13:02:05] I meant that the timeout to etcd seems related to knative trying to check some status [13:02:18] (03CR) 10Brouberol: airflow: add kafka connections to configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [13:02:28] ah, because of the timeout in the logs? I would not be so sure without checking ... it might be something that happens from time to time [13:02:28] (03CR) 10CI reject: [V:04-1] Add alerting for cloud-vps users outside of the Bastion project [alerts] - 10https://gerrit.wikimedia.org/r/1118798 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [13:02:43] okok! [13:02:58] Also, note that some of the errors are "timeout _or abort_", so maybe they came from the sutdown sequence of the k-api-server [13:03:04] as well maybe due to hitting a limit of etcd on ganeti and/or apiserver on ganeti [13:03:28] (same for canceled context) [13:03:43] could be yes, let's keep an eye on future occurrences though [13:03:50] aye [13:04:10] Now go and enjoy breakfast ;) [13:04:30] klausman: scroll back the logs I'd say - there might/should be other restars still in there [13:05:08] and maybe go check performance metrics of etcd, apiserver and the underlying nodes over the last weeks [13:07:14] (03CR) 10Btullis: aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:07:48] jayme: will do [13:15:45] (03PS1) 10CDanis: ceph: add aux-k8s-eqiad-csi user [puppet] - 10https://gerrit.wikimedia.org/r/1118805 (https://phabricator.wikimedia.org/T380541) [13:17:14] (03PS1) 10CDanis: ceph: add aux-k8s-eqiad-csi user [labs/private] - 10https://gerrit.wikimedia.org/r/1118806 (https://phabricator.wikimedia.org/T380541) [13:17:55] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [13:18:31] (03PS2) 10CDanis: aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) [13:21:08] !log adding BGP trace/verbose debug mode on magru CRs as advised by Juniper support T384774 [13:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:11] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [13:21:22] (03CR) 10Andrew Bogott: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1118798 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [13:32:52] (03CR) 10DCausse: "lgtm, small nit regarding the name of the test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [13:38:51] (03CR) 10Brouberol: [C:03+1] Update plugins to opensearch 1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1118553 (https://phabricator.wikimedia.org/T385005) (owner: 10DCausse) [13:39:25] (03PS2) 10CDanis: ceph: add aux-k8s-eqiad-csi user [labs/private] - 10https://gerrit.wikimedia.org/r/1118806 (https://phabricator.wikimedia.org/T380541) [13:39:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Drop obsolete CampaignEvents config flags (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118604 (https://phabricator.wikimedia.org/T380076) (owner: 10Daimona Eaytoy) [13:39:46] (03PS3) 10CDanis: ceph: add aux-k8s-csi-rbd user [labs/private] - 10https://gerrit.wikimedia.org/r/1118806 (https://phabricator.wikimedia.org/T380541) [13:40:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:40:52] (03PS3) 10CDanis: aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) [13:41:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:05] (03PS2) 10CDanis: ceph: add aux-k8s-csi-rbd user [puppet] - 10https://gerrit.wikimedia.org/r/1118805 (https://phabricator.wikimedia.org/T380541) [13:43:06] (03CR) 10FNegri: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1118798 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [13:43:44] (03CR) 10Andrew Bogott: [C:03+2] Add alerting for cloud-vps users outside of the Bastion project [alerts] - 10https://gerrit.wikimedia.org/r/1118798 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [13:44:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] core-Permissions: drop redundant CampaignEvents right assignments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [13:45:07] (03CR) 10Btullis: [C:03+1] ceph: add aux-k8s-csi-rbd user [puppet] - 10https://gerrit.wikimedia.org/r/1118805 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:45:45] (03CR) 10Brouberol: aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:46:13] (03CR) 10Btullis: aux-k8s-eqiad: add RBD-backed persistence (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:46:29] (03CR) 10Btullis: [C:03+1] ceph: add aux-k8s-csi-rbd user [labs/private] - 10https://gerrit.wikimedia.org/r/1118806 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:47:51] (03PS4) 10CDanis: aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) [13:47:57] (03CR) 10CDanis: aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:48:25] (03CR) 10CDanis: aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:50:20] (03PS1) 10Phuedx: [Experiment Platform]: Disable experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) [13:50:48] (03PS1) 10Michael Große: refactor(AddLink): ignore rows with `null` in Store [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118811 (https://phabricator.wikimedia.org/T382270) [13:51:09] (03PS1) 10Brouberol: Airflow: upgrade to confluent-kafka 2.8.0 and specify path to CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118812 (https://phabricator.wikimedia.org/T386092) [13:51:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118811 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [13:51:16] (03CR) 10Btullis: [C:03+1] aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:51:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) (owner: 10Phuedx) [13:53:29] (03Abandoned) 10Brouberol: envoy: add the analytics-web service to the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1116760 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [13:55:46] (03CR) 10Santiago Faci: [C:03+1] [Experiment Platform]: Disable experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) (owner: 10Phuedx) [13:59:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:59:18] (03CR) 10CDanis: [V:03+2 C:03+2] ceph: add aux-k8s-csi-rbd user [labs/private] - 10https://gerrit.wikimedia.org/r/1118806 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T1400). [14:00:05] Daimona, MichaelG_WMF, and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:11] o/ [14:00:15] (03CR) 10CDanis: [C:03+2] ceph: add aux-k8s-csi-rbd user [puppet] - 10https://gerrit.wikimedia.org/r/1118805 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:00:16] i can deploy [14:00:22] (making coffee but am here) [14:01:44] o/ [14:01:50] o/ [14:02:15] (03CR) 10Urbanecm: [C:03+2] refactor(AddLink): ignore rows with `null` in Store [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118811 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:03:07] (03CR) 10Urbanecm: [C:03+2] Drop obsolete CampaignEvents config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118604 (https://phabricator.wikimedia.org/T380076) (owner: 10Daimona Eaytoy) [14:03:11] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118812 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [14:03:24] (03CR) 10Urbanecm: [C:03+2] core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:03:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10538915 (10Gehel) [14:04:02] (03Merged) 10jenkins-bot: Drop obsolete CampaignEvents config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118604 (https://phabricator.wikimedia.org/T380076) (owner: 10Daimona Eaytoy) [14:04:19] (03Merged) 10jenkins-bot: core-Permissions: drop redundant CampaignEvents right assignments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116834 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:05:22] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1118604|Drop obsolete CampaignEvents config flags (T380076 T381423)]], [[gerrit:1116834|core-Permissions: drop redundant CampaignEvents right assignments (T376822)]] [14:05:28] T380076: Remove feature flag for event wikis - https://phabricator.wikimedia.org/T380076 [14:05:29] T381423: Remove feature flag for event topics - https://phabricator.wikimedia.org/T381423 [14:05:29] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:09:46] 06SRE, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10538953 (10Gehel) [14:09:55] !log urbanecm@deploy2002 daimona, urbanecm: Backport for [[gerrit:1118604|Drop obsolete CampaignEvents config flags (T380076 T381423)]], [[gerrit:1116834|core-Permissions: drop redundant CampaignEvents right assignments (T376822)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:03] Daimona: please test if there's anything testable [14:11:28] (03CR) 10Btullis: [C:03+1] aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:11:34] Note: my change cannot be meaningfully tested beyond "no new alerts popping up anywhere". For now, it is effectively a no-op, and it is intended to serve as preparation for being able to deploy a more risky change to the next branch. [14:12:01] ack MichaelG_WMF [14:12:25] urbanecm: ty, looks good! [14:12:47] !log urbanecm@deploy2002 daimona, urbanecm: Continuing with sync [14:12:49] proceeding [14:15:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:16:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:19:10] (03PS2) 10Phuedx: [Experiment Platform]: Disable experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) [14:19:34] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118604|Drop obsolete CampaignEvents config flags (T380076 T381423)]], [[gerrit:1116834|core-Permissions: drop redundant CampaignEvents right assignments (T376822)]] (duration: 14m 11s) [14:19:39] T380076: Remove feature flag for event wikis - https://phabricator.wikimedia.org/T380076 [14:19:40] T381423: Remove feature flag for event topics - https://phabricator.wikimedia.org/T381423 [14:19:40] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:20:45] (03CR) 10Urbanecm: [C:03+2] [Experiment Platform]: Disable experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) (owner: 10Phuedx) [14:20:58] (03Merged) 10jenkins-bot: refactor(AddLink): ignore rows with `null` in Store [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118811 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:21:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) (owner: 10Phuedx) [14:21:26] (03CR) 10Ottomata: [C:03+2] eventgate-analytics - upgrade to v1.10.0 and NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [14:21:39] (03Merged) 10jenkins-bot: [Experiment Platform]: Disable experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118810 (https://phabricator.wikimedia.org/T373715) (owner: 10Phuedx) [14:21:43] (03CR) 10CI reject: [V:04-1] eventgate-analytics - upgrade to v1.10.0 and NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [14:22:02] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with `null` in Store (T382270)]] [14:22:10] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [14:22:10] T383801: Remove Experimentation Lab's first test experiment - https://phabricator.wikimedia.org/T383801 [14:22:11] T382270: Store the fact that Add Link did not generate any recommendation for a page, don't try again - https://phabricator.wikimedia.org/T382270 [14:23:30] (03PS2) 10Ottomata: eventgate-analytics - upgrade to v1.10.0 and NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) [14:24:59] !log urbanecm@deploy2002 phuedx, migr, urbanecm: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with `null` in Store (T382270)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:07] phuedx: please test at mwdebug [14:25:11] urbanecm: on it [14:25:16] MichaelG_WMF: fyi ^^ [14:25:26] (03CR) 10Ottomata: [V:03+2 C:03+2] eventgate-analytics - upgrade to v1.10.0 and NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [14:26:44] urbanecm: LGTM. Experiment enrollment status not present on wikitechwiki but still present on testwiki [14:26:50] awesome [14:26:51] !log urbanecm@deploy2002 phuedx, migr, urbanecm: Continuing with sync [14:26:53] proceeding [14:27:44] @urbanecm thanks for letting me know, but there is nothing I can test on my end. The code is executed by a maintenance script and it is (currently) a no-op. [14:27:52] yep yep, makes sense [14:28:46] FYI i'm deploying eventgate-analytics to staging, will wait until window is finished to proceed with traffic serving clusters [14:28:57] ottomata: i'll ping you when done [14:29:00] ty [14:29:06] np [14:29:26] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:29:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [14:29:56] urbanecm: actually I messed up with one of my patches. Would there be enough time for a follow-up change? [14:30:06] Context: I accidentally removed the explicit overrides for testwiki and test2wiki [14:31:04] Daimona: sure, can you upload it? [14:31:15] Yep I'm writing it [14:31:16] ty [14:33:05] (03PS2) 10Brouberol: Airflow: upgrade to confluent-kafka 2.8.0 and specify path to CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118812 (https://phabricator.wikimedia.org/T386092) [14:33:39] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with `null` in Store (T382270)]] (duration: 11m 36s) [14:33:45] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [14:33:45] T383801: Remove Experimentation Lab's first test experiment - https://phabricator.wikimedia.org/T383801 [14:33:46] T382270: Store the fact that Add Link did not generate any recommendation for a page, don't try again - https://phabricator.wikimedia.org/T382270 [14:33:48] Daimona: waiting on your change now [14:34:15] (03PS1) 10Daimona Eaytoy: test(2)wiki: Re-assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118820 (https://phabricator.wikimedia.org/T376822) [14:34:23] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1118820 [14:34:36] (03CR) 10Urbanecm: [C:03+2] test(2)wiki: Re-assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118820 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:34:41] let's go ahead here! [14:34:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118820 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:35:31] (03Merged) 10jenkins-bot: test(2)wiki: Re-assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118820 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:35:31] It's the exact same change I made a few weeks ago, and that I then accidentally reverted. I was too excited dropping code and didn't notice I was dropping too much. [14:35:49] i saw it, but i thought it was intentional :D [14:35:59] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1118820|test(2)wiki: Re-assign event organizer rights to all users (T376822)]] [14:36:01] (03PS1) 10Gmodena: cirrus: deploy new mlr models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) [14:36:01] (03CR) 10Gmodena: "Thanks for the pointers re enabling wikis!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [14:36:02] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:36:09] (03PS2) 10Gmodena: cirrus: deploy new mlr models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) [14:36:53] (03CR) 10CI reject: [V:04-1] cirrus: deploy new mlr models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [14:37:35] Hm, yes, of course it was intentional. It was a test to see how thoroughly I would be testing my previous change. [14:37:40] hehe [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:25] (03CR) 10Gmodena: "mmm... gerrit does not show any suggestion/edit. What would you like to change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [14:39:43] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:40:39] !log urbanecm@deploy2002 urbanecm, daimona: Backport for [[gerrit:1118820|test(2)wiki: Re-assign event organizer rights to all users (T376822)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:40:41] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:40:45] Daimona: can you test? [14:41:06] Yup, looks good now [14:41:23] !log urbanecm@deploy2002 urbanecm, daimona: Continuing with sync [14:41:26] proceeding [14:41:32] Great, thanks! [14:48:07] (03CR) 10Jforrester: Remove temporary '-k8s' suffix from ArcLamp pipeline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (owner: 10Ori) [14:48:19] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118820|test(2)wiki: Re-assign event organizer rights to all users (T376822)]] (duration: 12m 20s) [14:48:22] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:48:25] Daimona: synced [14:48:27] anything else? [14:48:56] Nice, thanks again :) [14:49:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [14:49:42] James_F: 🤦 [14:49:44] thanks [14:50:07] (03CR) 10Ori: Remove temporary '-k8s' suffix from ArcLamp pipeline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (owner: 10Ori) [14:50:57] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:52:35] (03PS3) 10Gmodena: cirrus: deploy new mlr models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) [14:54:47] (03CR) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:58:39] (03PS1) 10Muehlenhoff: Add ebomani [puppet] - 10https://gerrit.wikimedia.org/r/1118823 [14:59:52] (03CR) 10Muehlenhoff: [C:03+2] Add ebomani [puppet] - 10https://gerrit.wikimedia.org/r/1118823 (owner: 10Muehlenhoff) [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:20] (03CR) 10Ottomata: "Suggestion: use the full cluster name in the connection name e.g. kafka_jumbo_eqiad. 'jumbo-eqiad' and 'test-eqiad' are the actual kafka " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [15:18:23] (03PS3) 10Brouberol: airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) [15:18:36] (03CR) 10Brouberol: "Good call!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [15:25:19] 06SRE, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#10539402 (10Gehel) [15:25:38] 06SRE, 10CirrusSearch, 10Elasticsearch, 13Patch-For-Review, 07Wikimedia-production-error: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#10539405 (10Gehel) [15:31:00] 06SRE-OnFire, 10Sustainability (Incident Followup): Better test environments for Elastic - https://phabricator.wikimedia.org/T317420#10539484 (10Gehel) [15:31:20] 06SRE-OnFire, 10Sustainability (Incident Followup): Replace certificate on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T315386#10539491 (10Gehel) [15:31:24] 10ops-codfw, 06SRE, 06DC-Ops: elastic2054 is down with memory error - https://phabricator.wikimedia.org/T315989#10539489 (10Gehel) [15:34:45] (03PS4) 10Brouberol: airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) [15:36:25] 06SRE, 06Traffic: SSL cert renewal warnings for cloudelastic100[5-6].wikimedia.org - https://phabricator.wikimedia.org/T261528#10539558 (10Gehel) [15:36:57] 06SRE, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189#10539568 (10Gehel) [15:37:58] 06SRE: commonswiki_content shards > 50GB - need resharding - https://phabricator.wikimedia.org/T246986#10539585 (10Gehel) [15:39:01] (03CR) 10CI reject: [V:04-1] airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [15:41:09] 06SRE, 10Wikidata, 10Wikidata-Query-Service: Blazegraph and updater failed on wdqs1009 - https://phabricator.wikimedia.org/T219052#10539628 (10Gehel) [15:42:22] 06SRE: Additional network ports for elasticsearch servers? - https://phabricator.wikimedia.org/T189854#10539653 (10Gehel) [15:43:58] (03PS5) 10Brouberol: airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) [15:44:06] 06SRE, 10Elasticsearch, 13Patch-For-Review: performance regression after upgrading elasticsearch to v5.3.2 - https://phabricator.wikimedia.org/T167685#10539673 (10Gehel) [15:45:38] 06SRE, 10Elasticsearch: elastic2028 fails to reimage - root device not found - https://phabricator.wikimedia.org/T157819#10539704 (10Gehel) [15:47:40] 10ops-eqiad, 06SRE, 06DC-Ops: elastic1027 does not reboot - https://phabricator.wikimedia.org/T146268#10539745 (10Gehel) [15:49:47] (03CR) 10Btullis: [C:03+1] airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [15:52:56] (03CR) 10Ottomata: [C:03+1] airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [15:55:48] (03CR) 10DCausse: [C:03+1] "lgtm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [15:55:59] 06SRE, 10Elasticsearch, 10Wikidata, 07Wikimedia-Incident: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590#10539910 (10Gehel) [15:56:35] 06SRE, 10Elasticsearch: elastic200[456] suddenly offlined - https://phabricator.wikimedia.org/T204772#10539924 (10Gehel) [15:57:27] 06SRE, 10Elasticsearch, 10Phabricator, 06Release-Engineering-Team: Setup a private elasticsearch cluster for phabricator - https://phabricator.wikimedia.org/T156939#10539938 (10Gehel) [15:57:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Elasticsearch, 13Patch-For-Review: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#10539939 (10Gehel) [15:57:42] !log jforrester@deploy2002 Started deploy [integration/docroot@55d716f]: List Message library and bump MW-CS to v46 [15:57:52] !log jforrester@deploy2002 Finished deploy [integration/docroot@55d716f]: List Message library and bump MW-CS to v46 (duration: 00m 09s) [15:58:08] 06SRE, 10Elasticsearch, 07Security: wait longer in es-tool before enabling replication - https://phabricator.wikimedia.org/T99500#10539949 (10Gehel) [15:58:12] 06SRE, 10CirrusSearch, 10Elasticsearch, 13Patch-For-Review, 07User-notice-archive: Search gives an errormessage instead of search results - https://phabricator.wikimedia.org/T102463#10539948 (10Gehel) [15:58:16] 06SRE, 10Elasticsearch, 07Security: Upgrade CirrusSearch's Elasticsearch cluster to 1.3.8+ - https://phabricator.wikimedia.org/T92853#10539952 (10Gehel) [15:58:20] 06SRE, 10Elasticsearch, 07Security: Upgrade all ElasticSearch clusters to 1.3.8+ (RCE vulnerability) - https://phabricator.wikimedia.org/T92770#10539953 (10Gehel) [15:58:24] 06SRE, 10Elasticsearch, 07Security: Update Logstash Elasticsearch to 1.3.8+ - https://phabricator.wikimedia.org/T92854#10539951 (10Gehel) [16:00:04] jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T1600). [16:00:17] 10SRE-tools, 10Elasticsearch, 06Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775#10539986 (10Gehel) [16:00:35] 10ops-codfw, 06SRE, 06DC-Ops: elastic2043 reported memory errors - https://phabricator.wikimedia.org/T321771#10539987 (10Gehel) [16:01:35] 06SRE, 10SRE-Access-Requests: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905#10540015 (10Gehel) [16:01:53] 06SRE, 10Elasticsearch, 10Maps: Elasticsearch metrics to be added to prometheus exporter - https://phabricator.wikimedia.org/T210523#10540021 (10Gehel) [16:02:03] 06SRE, 10SRE-Access-Requests, 10Elasticsearch, 13Patch-For-Review: add onimisionipe to restricted group - https://phabricator.wikimedia.org/T204980#10540022 (10Gehel) [16:02:52] (03CR) 10Brouberol: [C:03+2] airflow: add kafka connections to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118784 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [16:03:46] 10ops-codfw, 06SRE, 06DC-Ops, 10Elasticsearch, and 2 others: codfw: elastic2025-elastic2036/switch port configuration - https://phabricator.wikimedia.org/T154605#10540043 (10Gehel) [16:04:15] 06SRE, 10Elasticsearch, 07Wikimedia-production-error: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#10540049 (10Gehel) [16:04:27] 06SRE, 10Elasticsearch, 10observability, 10Wikimedia-Logstash, 13Patch-For-Review: Disable cron job to clear elasticsearch caches and validate that it does not have significant impact on GC - https://phabricator.wikimedia.org/T144396#10540051 (10Gehel) [16:04:31] 06SRE, 10Elasticsearch, 10observability, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#10540054 (10Gehel) [16:05:01] 06SRE, 10Dumps-Generation, 10Elasticsearch, 13Patch-For-Review: Link "current" to last dump set on cirrussearch get a 404 - https://phabricator.wikimedia.org/T138176#10540060 (10Gehel) [16:05:13] 06SRE, 10Elasticsearch: high load on elastic1001 - https://phabricator.wikimedia.org/T135509#10540063 (10Gehel) [16:05:38] 06SRE, 10CirrusSearch, 10Elasticsearch, 13Patch-For-Review: Enable metric collection on nginx for elasticsearch - https://phabricator.wikimedia.org/T130365#10540070 (10Gehel) [16:05:42] 06SRE, 10CirrusSearch, 10Elasticsearch, 13Patch-For-Review: Should we have a specific check for SSL certificate expiration on elasticsearch - https://phabricator.wikimedia.org/T130366#10540069 (10Gehel) [16:05:46] 06SRE, 10Elasticsearch, 10observability, 10Wikimedia-Logstash, 13Patch-For-Review: logstash - nginx failed service start - https://phabricator.wikimedia.org/T129934#10540071 (10Gehel) [16:06:41] jouncebot now [16:06:41] For the next 0 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T1600) [16:06:52] 06SRE, 10Elasticsearch: Investigate mysterious write load during general read-only maintenance - https://phabricator.wikimedia.org/T109127#10540086 (10Gehel) [16:06:57] !log dancy@deploy2002 Installing scap version "4.136.0" for 204 host(s) [16:07:57] 06SRE, 10Elasticsearch: Investigate tweaking of the "wait for me" parameter for upgrades / restarts - https://phabricator.wikimedia.org/T109091#10540106 (10Gehel) [16:08:25] !log krinkle@deploy2002 Started deploy [statsv/statsv@7b87958]: T383953: Enable intake of dogstatsd timer messages [16:08:28] T383953: Statsv support for timer metrics - https://phabricator.wikimedia.org/T383953 [16:08:31] 06SRE, 10Elasticsearch, 10observability, 10Wikimedia-Logstash: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#10540120 (10Gehel) [16:08:34] !log krinkle@deploy2002 Finished deploy [statsv/statsv@7b87958]: T383953: Enable intake of dogstatsd timer messages (duration: 00m 08s) [16:09:32] 06SRE, 10Elasticsearch, 10observability, 10Wikimedia-Logstash, 07Grafana: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#10540133 (10Gehel) [16:11:41] !log dancy@deploy2002 Installation of scap version "4.136.0" completed for 204 hosts [16:13:36] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T300946#10540196 (10Gehel) [16:13:54] 14ops-core, 07Puppet, 06SRE, 10SRE-swift-storage, and 12 others: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#10540193 (10Gehel) [16:14:29] 06SRE, 10Elasticsearch, 07Epic: Collect threaddumps from elasticsearch at regular intervals - https://phabricator.wikimedia.org/T130209#10540211 (10Gehel) [16:22:12] (03PS1) 10Brouberol: airflow: add kafka-{test,jumbo}-eqiad connections to the remaining instances [puppet] - 10https://gerrit.wikimedia.org/r/1118831 (https://phabricator.wikimedia.org/T379676) [16:22:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 2 others: wdqs1025 fails to PXE boot, NIC shows "no link" in DRAC web UI - https://phabricator.wikimedia.org/T381283#10540334 (10Gehel) [16:22:33] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10540336 (10Gehel) [16:22:43] 10ops-codfw, 06SRE, 06DC-Ops: Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10540344 (10Gehel) [16:22:53] 10ops-codfw, 06SRE, 06DC-Ops: Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10540347 (10Gehel) [16:23:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 2 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10540342 (10Gehel) [16:25:08] 06SRE-OnFire, 10Observability-Alerting, 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095#10540397 (10Gehel) [16:25:12] 06SRE, 10Elasticsearch: Reindex commonswiki as shards have grown beyond critical threshold - https://phabricator.wikimedia.org/T231446#10540409 (10Gehel) [16:25:16] 06SRE, 10Elasticsearch: Port elasticsearch support scripts to cookbooks - https://phabricator.wikimedia.org/T269218#10540411 (10Gehel) [16:25:20] 06SRE, 10CirrusSearch: re-enable deprecation warning logger on elasticsearch once issues are solved - https://phabricator.wikimedia.org/T218995#10540415 (10Gehel) [16:25:32] 06SRE, 10Elasticsearch, 10observability, 06SRE Observability, 10Wikimedia-Logstash: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125#10540417 (10Gehel) [16:25:52] 06SRE: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540#10540429 (10Gehel) [16:26:00] 06SRE, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181#10540435 (10Gehel) [16:26:21] 06SRE: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736#10540437 (10Gehel) [16:26:25] 06SRE, 10Elasticsearch, 07good first task: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#10540431 (10Gehel) [16:26:55] 06SRE, 10Elasticsearch, 06Traffic: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681#10540461 (10Gehel) [16:27:09] 06SRE, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812#10540459 (10Gehel) [16:27:13] 06SRE, 10Elasticsearch, 10Icinga, 10observability, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691#10540463 (10Gehel) [16:27:21] (03CR) 10Bking: [C:03+2] Update plugins to opensearch 1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1118553 (https://phabricator.wikimedia.org/T385005) (owner: 10DCausse) [16:27:25] (03CR) 10Bking: [V:03+2 C:03+2] Update plugins to opensearch 1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1118553 (https://phabricator.wikimedia.org/T385005) (owner: 10DCausse) [16:27:54] (03PS2) 10Brouberol: airflow: add kafka-{test,jumbo}-eqiad connections to the remaining instances [puppet] - 10https://gerrit.wikimedia.org/r/1118831 (https://phabricator.wikimedia.org/T379676) [16:28:16] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4935/co" [puppet] - 10https://gerrit.wikimedia.org/r/1118831 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [16:38:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10540742 (10phaultfinder) [16:44:12] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@ffd0a45]: bump image suggestions to v0.25.0 [16:44:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10540871 (10phaultfinder) [16:44:53] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@ffd0a45]: bump image suggestions to v0.25.0 (duration: 01m 03s) [16:47:49] (03PS3) 10Brouberol: Airflow: upgrade to confluent-kafka 2.8.0 and specify path to CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118812 (https://phabricator.wikimedia.org/T386092) [16:48:40] (03PS4) 10Brouberol: Airflow: upgrade to confluent-kafka 2.8.0 and specify path to CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118812 (https://phabricator.wikimedia.org/T386092) [16:48:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [16:56:56] (03PS6) 10Bernard Wang: Turn on sampling rate for web ab test schemas for basque and ca wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 [16:57:41] (03CR) 10CI reject: [V:04-1] Turn on sampling rate for web ab test schemas for basque and ca wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (owner: 10Bernard Wang) [16:57:49] (03CR) 10DCausse: "oops seems like I commented on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1118783/1/wmf-config/ext-CirrusSearch.php bu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [16:58:43] (03CR) 10DCausse: cirrus: create buckets for mlr 2025 experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [16:58:53] (03PS7) 10Bernard Wang: Turn on sampling rate for web ab test schemas for basque and ca wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 [17:01:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [17:02:26] (03CR) 10Jdlrobson: [C:03+1] Turn on sampling rate for web ab test schemas for basque and ca wiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (owner: 10Bernard Wang) [17:07:31] (03PS8) 10Bernard Wang: Deploy web ab test to eu and ca wiki for an initial test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) [17:07:50] (03PS2) 10Gmodena: cirrus: create buckets for mlr 2025 experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) [17:08:37] (03CR) 10Gmodena: cirrus: create buckets for mlr 2025 experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [17:10:30] (03PS3) 10Gmodena: cirrus: create buckets for mlr 2025 experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) [17:12:23] (03PS4) 10Gmodena: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) [17:14:11] (03PS5) 10Gmodena: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) [17:19:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10541027 (10phaultfinder) [17:35:18] (03PS1) 10Jdlrobson: Fixes: Selecting search results on mobile website in Firefox does not work [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118841 (https://phabricator.wikimedia.org/T381289) [17:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:42] (03CR) 10Brouberol: [C:03+2] Airflow: upgrade to confluent-kafka 2.8.0 and specify path to CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118812 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T1800) [18:13:24] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10541245 (10Andrew) 05Open→03Resolved These moves are done. [18:14:10] (03PS2) 10Dbrant: Add app_game_interaction event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118577 (https://phabricator.wikimedia.org/T385323) [18:15:26] (03PS1) 10Btullis: dse-k8s: Use partman recipes for containerd with local storage support [puppet] - 10https://gerrit.wikimedia.org/r/1118844 (https://phabricator.wikimedia.org/T377875) [18:16:15] (03PS2) 10Btullis: dse-k8s: Use partman recipes for containerd with local storage support [puppet] - 10https://gerrit.wikimedia.org/r/1118844 (https://phabricator.wikimedia.org/T377875) [18:16:55] (03PS3) 10Dbrant: Add app_game_interaction event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118577 (https://phabricator.wikimedia.org/T385323) [18:17:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118577 (https://phabricator.wikimedia.org/T385323) (owner: 10Dbrant) [18:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10541282 (10phaultfinder) [18:31:00] (03PS1) 10Btullis: dse-k8s: Stop installing the amd rocm packages to dse-k8s-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/1118846 (https://phabricator.wikimedia.org/T377875) [18:31:27] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118846 (https://phabricator.wikimedia.org/T377875) (owner: 10Btullis) [18:32:22] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [18:34:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [18:37:46] (03PS2) 10TChin: Eventstreams: Bump image, use service-utils [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111105 (https://phabricator.wikimedia.org/T361769) [18:47:28] (03PS6) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 [18:47:34] 06SRE, 06Data-Engineering, 10Dumps 2.0, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10541350 (10Ottomata) [18:48:23] (03PS7) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 [18:48:44] (03CR) 10CI reject: [V:04-1] Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery) [18:49:22] (03CR) 10Pppery: "Noticed another thing to fix, also adding someone who has +2 in this repo and has merged similar dumps-related patches to hopefully get th" [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery) [18:50:04] (03PS8) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 [18:54:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [18:58:21] (03PS9) 10Bernard Wang: Deploy web ab test to eu and ca wiki for an initial test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) [18:59:38] (03CR) 10Bernard Wang: [C:03+1] "I updated the bucketing to 99/1 and sample rate to 50 based off a convo with jennifer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) (owner: 10Bernard Wang) [19:09:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10541458 (10phaultfinder) [19:29:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 426691416 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:31:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 99096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:34:32] (03CR) 10Jdrewniak: [C:03+1] Fixes: Selecting search results on mobile website in Firefox does not work [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118841 (https://phabricator.wikimedia.org/T381289) (owner: 10Jdlrobson) [19:43:20] (03CR) 10Stoyofuku-wmf: ".99 + .01 + .01 = 1.01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) (owner: 10Bernard Wang) [19:53:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 106927504 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:54:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 24544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:22:50] (03CR) 10Ottomata: [C:03+1] Eventstreams: Bump image, use service-utils [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111105 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [20:57:25] (03CR) 10Cathal Mooney: [C:03+2] Add FIDO2-based ssh keys for user cmooney [puppet] - 10https://gerrit.wikimedia.org/r/1115495 (https://phabricator.wikimedia.org/T385229) (owner: 10Cathal Mooney) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T2100). [21:00:04] dbrant: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:02:22] whoops, o/ here [21:17:18] (03PS10) 10Bernard Wang: Deploy web ab test to eu and ca wiki for an initial test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) [21:20:36] o/ deployers about? [21:22:50] dbrant: I can help out [21:23:08] cool, it's just one config change [21:23:31] OK. Do you have a way to test it after it reaches testservers? [21:24:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118577 (https://phabricator.wikimedia.org/T385323) (owner: 10Dbrant) [21:24:07] yep, believe so [21:24:15] Awesome. Started off the process. [21:24:45] (03Merged) 10jenkins-bot: Add app_game_interaction event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118577 (https://phabricator.wikimedia.org/T385323) (owner: 10Dbrant) [21:25:15] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1118577|Add app_game_interaction event stream. (T385323)]] [21:25:18] T385323: Create new data stream for experiment data `app_game_interaction` - https://phabricator.wikimedia.org/T385323 [21:28:16] !log dancy@deploy2002 dbrant, dancy: Backport for [[gerrit:1118577|Add app_game_interaction event stream. (T385323)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:08] yep, seeing it! [21:29:36] Excellent. Proceeding [21:29:38] !log dancy@deploy2002 dbrant, dancy: Continuing with sync [21:36:19] (03PS14) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [21:36:20] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118577|Add app_game_interaction event stream. (T385323)]] (duration: 11m 04s) [21:36:23] T385323: Create new data stream for experiment data `app_game_interaction` - https://phabricator.wikimedia.org/T385323 [21:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:57] (03CR) 10Jdlrobson: [C:03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) (owner: 10Bernard Wang) [21:55:45] (03PS1) 10Hokwelum: ResourceLoader: remove module_deps from tables-catalog [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T379661) [21:57:52] (03PS1) 10Jdlrobson: Fixes: Selecting search results on mobile website in Firefox does not work [extensions/MobileFrontend] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1118873 (https://phabricator.wikimedia.org/T381289) [21:58:07] @jan_drewniak around are we deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/1118841?usp=search now? We also need to backport that one ^ [21:58:19] (or just that one and let it roll out with the current train) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250211T2200) [22:01:47] dancy: is the backport window still going? [22:02:49] jan_drewniak: It officially ended a minute ago but I can help you out if needed [22:03:28] dancy: ok great, Web team has a dedicated deploy window right after, so just wanted to check if you're done :) [22:04:01] ah, gotcha. Yeah, have at it! [22:06:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1118873 (https://phabricator.wikimedia.org/T381289) (owner: 10Jdlrobson) [22:06:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118841 (https://phabricator.wikimedia.org/T381289) (owner: 10Jdlrobson) [22:06:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118558 (https://phabricator.wikimedia.org/T383936) (owner: 10Jdlrobson) [22:16:03] (03Merged) 10jenkins-bot: Fixes: Selecting search results on mobile website in Firefox does not work [extensions/MobileFrontend] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1118873 (https://phabricator.wikimedia.org/T381289) (owner: 10Jdlrobson) [22:16:15] (03Merged) 10jenkins-bot: Fixes: Selecting search results on mobile website in Firefox does not work [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118841 (https://phabricator.wikimedia.org/T381289) (owner: 10Jdlrobson) [22:16:16] (03Merged) 10jenkins-bot: Add search activity id [extensions/WikimediaEvents] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1118558 (https://phabricator.wikimedia.org/T383936) (owner: 10Jdlrobson) [22:16:49] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1118873|Fixes: Selecting search results on mobile website in Firefox does not work (T381289)]], [[gerrit:1118841|Fixes: Selecting search results on mobile website in Firefox does not work (T381289)]], [[gerrit:1118558|Add search activity id (T383936)]] [22:16:54] T381289: Selecting search results on mobile website in Firefox does not work - https://phabricator.wikimedia.org/T381289 [22:16:54] T383936: Implement search activity id for Empty Search A/B test click tracking - https://phabricator.wikimedia.org/T383936 [22:19:52] !log jdrewniak@deploy2002 jdlrobson, jdrewniak: Backport for [[gerrit:1118873|Fixes: Selecting search results on mobile website in Firefox does not work (T381289)]], [[gerrit:1118841|Fixes: Selecting search results on mobile website in Firefox does not work (T381289)]], [[gerrit:1118558|Add search activity id (T383936)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:20:12] (03CR) 10TChin: [C:03+2] Eventstreams: Bump image, use service-utils [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111105 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [22:21:38] (03Merged) 10jenkins-bot: Eventstreams: Bump image, use service-utils [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111105 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [22:23:24] !log jdrewniak@deploy2002 jdlrobson, jdrewniak: Continuing with sync [22:28:21] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:35] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:30:04] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118873|Fixes: Selecting search results on mobile website in Firefox does not work (T381289)]], [[gerrit:1118841|Fixes: Selecting search results on mobile website in Firefox does not work (T381289)]], [[gerrit:1118558|Add search activity id (T383936)]] (duration: 13m 14s) [22:30:08] T381289: Selecting search results on mobile website in Firefox does not work - https://phabricator.wikimedia.org/T381289 [22:30:08] T383936: Implement search activity id for Empty Search A/B test click tracking - https://phabricator.wikimedia.org/T383936 [22:30:25] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) (owner: 10Bernard Wang) [22:33:30] (03Merged) 10jenkins-bot: Deploy web ab test to eu and ca wiki for an initial test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118566 (https://phabricator.wikimedia.org/T384019) (owner: 10Bernard Wang) [22:33:57] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1118566|Deploy web ab test to eu and ca wiki for an initial test (T384019)]] [22:34:00] T384019: Deploy Empty Search A/B test - https://phabricator.wikimedia.org/T384019 [22:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.032s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:36:22] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10542125 (10Jhancock.wm) [22:36:52] !log jdrewniak@deploy2002 jdrewniak, bwang: Backport for [[gerrit:1118566|Deploy web ab test to eu and ca wiki for an initial test (T384019)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:39:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [22:40:57] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10542130 (10Jhancock.wm) [22:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.032s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:41:20] !log jdrewniak@deploy2002 jdrewniak, bwang: Continuing with sync [22:48:02] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118566|Deploy web ab test to eu and ca wiki for an initial test (T384019)]] (duration: 14m 05s) [22:48:06] T384019: Deploy Empty Search A/B test - https://phabricator.wikimedia.org/T384019 [22:59:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert