[00:03:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:04:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047610 (owner: 10TrainBranchBot) [00:16:37] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2024-06-11 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:19:35] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2024-06-11 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:22:35] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2024-06-11 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:25:37] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2024-06-11 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:25:01] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9908152 (10Papaul) Your replacement part associated with RMA R200519348 Item # 100 has been successfully shipped. Details of which are provided below. Tracking URL: https://wwwapps.ups.com/We... [01:37:09] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 80%, RTA = 30.32 ms [01:38:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367856)', diff saved to https://phabricator.wikimedia.org/P65212 and previous config saved to /var/cache/conftool/dbconfig/20240620-013827-marostegui.json [01:38:33] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:38:55] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:40] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bookworm [16:33:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: post T365987 repool', diff saved to https://phabricator.wikimedia.org/P65256 and previous config saved to /var/cache/conftool/dbconfig/20240620-163348-arnaudb.json [16:33:57] T365987: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987 [16:34:22] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:36:57] 14SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810#9910823 (10herron) >>! In T253810#9485013, @fgiunchedi wrote: > `ipmi_exporter` now has support to collect generic SEL entries and export metrics from those... [16:37:21] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Untaint exim domain variable [puppet] - 10https://gerrit.wikimedia.org/r/1048027 (https://phabricator.wikimedia.org/T368063) (owner: 10EoghanGaffney) [16:37:36] (03PS73) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [16:43:14] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368065#9910861 (10VRiley-WMF) a:03VRiley-WMF [16:43:56] (03PS74) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [16:44:26] (03CR) 10CI reject: [V:04-1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [16:48:26] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:48:47] FIRING: ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1013-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: post T365987 repool', diff saved to https://phabricator.wikimedia.org/P65257 and previous config saved to /var/cache/conftool/dbconfig/20240620-164857-arnaudb.json [16:50:48] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:07] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:53:16] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9910877 (10RKemper) Host has been downtimed. Accidentally associated to wrong ticket: https://phabricator.wikimedia.org/T367825#9908323 [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1700) [17:07:48] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9910948 (10bking) Thanks @Jhancock.wm ! If you or anyone else reading this is interested in alerting based on SEL errors, I created T367790 , feel free to add yo... [17:08:55] (03PS1) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:09:19] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [17:14:07] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [17:15:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [17:15:26] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS bookworm [17:17:24] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt1053 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1048041 (https://phabricator.wikimedia.org/T364457) [17:19:25] (03PS2) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:19:48] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [17:20:53] (03PS3) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:22:30] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3002/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:24:07] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3003/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:24:50] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [17:25:02] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3004/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:25:54] (03CR) 10BCornwall: [V:03+1 C:03+1] haproxy:cache: raise max uri len to 2048B [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:26:22] (03PS4) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:29:15] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 24.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:29:31] (03PS5) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:30:20] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bookworm [17:30:40] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt1053 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1048041 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [17:30:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65258 and previous config saved to /var/cache/conftool/dbconfig/20240620-173050-marostegui.json [17:30:56] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:33:21] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [17:37:47] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 60%, RTA = 0.41 ms [17:38:26] 10ops-eqiad, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093 (10Andrew) 03NEW [17:40:37] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:12] (03PS6) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:41:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [17:41:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [17:41:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T352010)', diff saved to https://phabricator.wikimedia.org/P65259 and previous config saved to /var/cache/conftool/dbconfig/20240620-174125-ladsgroup.json [17:41:31] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:43:36] (03PS3) 10Ladsgroup: prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) [17:43:40] (03CR) 10Ladsgroup: [V:03+2 C:03+2] prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup) [17:44:02] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2088.codfw.wmnet [17:44:40] (03CR) 10Fabfur: [C:03+2] haproxy:cache: raise max uri len to 2048B [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:44:53] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 90%, RTA = 33.83 ms [17:44:56] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [17:45:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P65260 and previous config saved to /var/cache/conftool/dbconfig/20240620-174557-marostegui.json [17:48:55] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [17:51:44] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [17:55:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:25] (03PS7) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [17:57:01] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [17:57:59] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9911180 (10wiki_willy) During my call with the Dell Account team today, I asked them to push on this a bit more. Th... [18:00:05] jnuche and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1800). [18:01:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P65261 and previous config saved to /var/cache/conftool/dbconfig/20240620-180104-marostegui.json [18:03:37] (03PS8) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [18:04:03] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [18:06:52] !log bking@an-airflow1007 install `ripgrep` deb pkg [18:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:44] (03CR) 10Scott French: [C:03+2] service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [18:13:08] (03PS1) 10Dzahn: aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960) [18:14:15] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:15:10] PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:16:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65262 and previous config saved to /var/cache/conftool/dbconfig/20240620-181613-marostegui.json [18:16:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:16:19] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:16:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:16:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T364069)', diff saved to https://phabricator.wikimedia.org/P65263 and previous config saved to /var/cache/conftool/dbconfig/20240620-181635-marostegui.json [18:19:10] RECOVERY - MariaDB Replica Lag: s1 #page on db1206 is OK: OK slave_sql_lag Replication lag: 31.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:19:42] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [18:20:41] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911239 (10Ladsgroup) Somehow [[https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaMaintenance/+/refs/heads/master/addWiki.php#249|s... [18:21:04] sigh [18:21:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bookworm [18:21:09] I know what's going on [18:22:15] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 32.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:17] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911243 (10Ladsgroup) Before run: ` root@ms-fe1009:~# swift list | grep u4c root@ms-fe1009:~# ` After run: ` root@ms-fe1009:~# swift list |... [18:24:16] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911245 (10Ladsgroup) Done for itwiki arbcom too: ` root@ms-fe1009:~# for i in $(swift list | grep wikipedia-arbcom-it) ; do echo "$i:"... [18:25:04] (03CR) 10Scott French: [C:03+2] envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [18:25:51] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911246 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [18:34:18] jouncebot: nowandnext [18:34:19] For the next 1 hour(s) and 25 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1800) [18:34:19] In 1 hour(s) and 25 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T2000) [18:34:42] 06SRE, 10Cassandra, 06Data Products, 06serviceops, 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9911291 (10Scott_French) [18:35:07] (03PS2) 10Dzahn: aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960) [18:39:15] (03PS1) 10Zabe: Enable local uploads on newly created wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048055 (https://phabricator.wikimedia.org/T366649) [18:43:15] (03CR) 10Dzahn: [C:03+1] "I see! Alright, +1 then" [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [18:45:54] (03CR) 10Dzahn: [C:03+1] "This has happened meanwhile." [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [18:46:28] 06SRE, 10Cassandra, 06Data Products, 06serviceops, 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9911340 (10Scott_French) [18:47:23] 06SRE, 10Cassandra, 06Data Products, 06serviceops, 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9911350 (10Scott_French) 05In progress→03Resolved I believe that should be everything now. I'll follow-up in T368096 for ite... [18:51:15] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:51:32] (03PS1) 10Jforrester: [wikifunctions] Grant wikifunctions-staff enum and converter rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) [18:52:21] (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: new graph split instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1048060 (https://phabricator.wikimedia.org/T364077) [18:53:36] (03CR) 10Dzahn: [C:03+2] logstash_checker.py: Add --time option [puppet] - 10https://gerrit.wikimedia.org/r/1041746 (owner: 10Ahmon Dancy) [18:58:03] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [19:00:48] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:01:19] 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098 (10Dzahn) 03NEW [19:01:41] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [19:03:09] 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9911439 (10Dzahn) [19:03:47] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2088.codfw.wmnet [19:05:06] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9911440 (10VRiley-WMF) Swapped out B1 with another compatible DIMM and the unit should be coming back online. [19:05:09] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T368099 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:05:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368099 (10ops-monitoring-bot) 03NEW [19:07:01] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9911467 (10VRiley-WMF) 05Open→03Resolved [19:07:01] (03CR) 10Dzahn: [C:03+2] "tests passed:" [puppet] - 10https://gerrit.wikimedia.org/r/1041746 (owner: 10Ahmon Dancy) [19:10:15] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9911479 (10VRiley-WMF) I swapped the mainboard with a compatible server. Upon booting, it didn't seem to see any memory again. Troubleshot this with @Papaul to no avail. Was instructed to put the... [19:10:48] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004#9911476 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr No faults since jun 9th [19:12:41] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:11] (03PS9) 10Ryan Kemper: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:16:19] 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#9911494 (10Ladsgroup) @Marostegui To get the list of direct replicas, something like this would... [19:17:38] 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9911495 (10Ladsgroup) [19:18:44] 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9911497 (10Ladsgroup) It has limit of 50,000 and it hits them with 10 of them at the same time. [19:18:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1105 for T348977 - bking@cumin2002 [19:18:45] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1105 for T348977 - bking@cumin2002 [19:18:50] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [19:18:50] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1105* for T348977 - bking@cumin2002 [19:18:53] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1105* for T348977 - bking@cumin2002 [19:21:17] (03PS1) 10Ssingh: hiera dnsbox and P:bird: remove references to ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) [19:22:48] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3006/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [19:23:00] (03PS1) 10Ssingh: policies/cr-labs: remove obsolete ntp.anycast.wmnet [homer/public] - 10https://gerrit.wikimedia.org/r/1048066 (https://phabricator.wikimedia.org/T366360) [19:23:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:25:26] (03PS1) 10Ssingh: conftool-data: remove ntp service [puppet] - 10https://gerrit.wikimedia.org/r/1048067 (https://phabricator.wikimedia.org/T366360) [19:26:39] (03CR) 10Ssingh: [V:03+1] "To be merged week of Jun 24." [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [19:32:26] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:34] (03PS3) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) [19:42:26] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:07] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:07] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:47:26] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:45] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774) [19:50:37] (03PS2) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774) [19:50:59] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [19:51:41] (03PS2) 10Scott French: kubernetes: split unavailable-replicas alert per team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) [19:52:04] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [19:52:59] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [19:53:08] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [19:54:36] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [19:55:59] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:56:00] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:57:31] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:57:32] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:57:48] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [19:57:50] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:57:51] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:57:54] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:57:55] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:58:24] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:58:51] (03CR) 10Muehlenhoff: [C:03+1] "Indeed, good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:04:14] (03CR) 10Scott French: "Thank you both!" [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [20:10:16] (03PS1) 10Bking: team-data-platform: Add all team-search-platform alerts to team-data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) [20:20:18] (03CR) 10Scott French: [C:03+2] aqs-http-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:22:02] (03Merged) 10jenkins-bot: aqs-http-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:24:50] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [20:25:03] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [20:25:23] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [20:25:35] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [20:26:04] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [20:26:15] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [20:26:50] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [20:27:01] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [20:27:23] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [20:27:33] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [20:27:55] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [20:28:06] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [20:30:50] (03PS1) 10Ebernhardson: cirrus: Bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048079 [20:33:58] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [20:34:24] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [20:36:28] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [20:36:46] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [20:38:58] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [20:39:16] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [20:40:42] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [20:40:59] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [20:42:42] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [20:43:01] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [20:44:35] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [20:44:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:44:53] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [20:53:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on elastic1105.eqiad.wmnet with reason: T348977 [20:53:07] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [20:53:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on elastic1105.eqiad.wmnet with reason: T348977 [20:53:51] RECOVERY - Host elastic2088 is UP: PING OK - Packet loss = 0%, RTA = 31.03 ms [20:59:35] (03PS1) 10Peter Fischer: Search update pipeline: use rate-limited HTTP client [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048083 (https://phabricator.wikimedia.org/T362310) [21:00:19] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: use rate-limited HTTP client [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048083 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [21:02:02] (03Merged) 10jenkins-bot: Search update pipeline: use rate-limited HTTP client [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048083 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [21:03:12] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [21:03:26] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [21:03:45] (03CR) 10Ebernhardson: [C:03+2] cirrus: Bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048079 (owner: 10Ebernhardson) [21:04:16] (03Merged) 10jenkins-bot: cirrus: Bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048079 (owner: 10Ebernhardson) [21:04:44] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:04:50] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [21:05:05] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [21:05:29] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368065#9911919 (10VRiley-WMF) Swapped out power supply. It looks like it's now reporting properly [21:06:14] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [21:06:30] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [21:07:46] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:07:47] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [21:08:03] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [21:08:32] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:08:52] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:09:15] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:09:42] !log Include ncmonitor 1.0.0 in wikimedia-bookworm apt repo [21:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:53] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [21:10:08] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [21:11:47] !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:12:04] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [21:12:21] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [21:12:21] !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:14:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:34:57] (03CR) 10JHathaway: [C:03+1] Move puppetdb.tuning.conf template [puppet] - 10https://gerrit.wikimedia.org/r/1047498 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [21:35:53] (03CR) 10JHathaway: [C:03+1] Move update-netboot-image.sh to the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [21:37:03] (03CR) 10JHathaway: [C:03+1] Remove obsolete motd [puppet] - 10https://gerrit.wikimedia.org/r/1047949 (owner: 10Muehlenhoff) [21:43:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367856)', diff saved to https://phabricator.wikimedia.org/P65265 and previous config saved to /var/cache/conftool/dbconfig/20240620-214326-marostegui.json [21:43:32] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [21:45:56] (03PS3) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) [21:50:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:39] (03PS1) 10David Martin: Add wikilambda_zobject_join to puppet script for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) [21:54:39] (03CR) 10David Martin: "The table creation patch merged today (June 20); will deploy next week." [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin) [21:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P65266 and previous config saved to /var/cache/conftool/dbconfig/20240620-215833-marostegui.json [22:03:11] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:03:33] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:04:49] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:10:11] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:12:00] (03CR) 10EoghanGaffney: [C:03+2] lists: Add symlink to /var/lib/mailman3 when using different root [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:13:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P65267 and previous config saved to /var/cache/conftool/dbconfig/20240620-221340-marostegui.json [22:14:13] (03CR) 10Dzahn: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1047470/3007/gitlab2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto) [22:15:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:17] (03CR) 10Dzahn: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [22:19:58] (03CR) 10EoghanGaffney: [C:03+1] aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960) (owner: 10Dzahn) [22:20:07] (03CR) 10Dzahn: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [22:21:46] jouncebot: nowandnext [22:21:47] No deployments scheduled for the next 7 hour(s) and 38 minute(s) [22:21:47] In 7 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240621T0600) [22:22:26] (03CR) 10Zabe: [C:03+2] Enable local uploads on newly created wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048055 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [22:23:05] (03Merged) 10jenkins-bot: Enable local uploads on newly created wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048055 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [22:24:19] (03CR) 10Andrea Denisse: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [22:26:38] (03PS3) 10Dzahn: admin: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [22:27:31] (03CR) 10Dzahn: [C:03+2] aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960) (owner: 10Dzahn) [22:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367856)', diff saved to https://phabricator.wikimedia.org/P65268 and previous config saved to /var/cache/conftool/dbconfig/20240620-222847-marostegui.json [22:28:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:28:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:29:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:29:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T367856)', diff saved to https://phabricator.wikimedia.org/P65269 and previous config saved to /var/cache/conftool/dbconfig/20240620-222909-marostegui.json [22:30:29] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3008/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [22:33:44] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] mailman3: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [22:33:50] !log zabe@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T361041 T363825 T366649 (duration: 09m 55s) [22:33:59] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [22:33:59] T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825 [22:33:59] T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649 [22:39:57] !log aphlict1002/aphlict2001 - systemctl stop aphlict_lograte.timer (and .service); systemctl disable aphlict_logrotate.timer (and .service); systemctl daemon-reload; systemctl reset-failed T367960 [22:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:02] T367960: SystemdUnitFailed - aphlict1002 - logrotate - https://phabricator.wikimedia.org/T367960 [22:42:19] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:44:45] (03CR) 10Dzahn: "gerrit thinks it's still my turn (attention set) even though this is already merged. so +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1047116 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [22:45:13] (03CR) 10Dzahn: [C:03+1] "related: updated secret in private repo as requested" [puppet] - 10https://gerrit.wikimedia.org/r/1047116 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [22:48:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364069)', diff saved to https://phabricator.wikimedia.org/P65270 and previous config saved to /var/cache/conftool/dbconfig/20240620-224803-marostegui.json [22:48:09] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:03:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P65271 and previous config saved to /var/cache/conftool/dbconfig/20240620-230310-marostegui.json [23:13:35] (03PS1) 10Zabe: Initial configuration for btmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048116 (https://phabricator.wikimedia.org/T368038) [23:14:39] (03CR) 10Bartosz Dziewoński: [C:03+1] "Looks good enough to try on the beta cluster. I have a few questions about some details." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [23:14:46] (03CR) 10Bartosz Dziewoński: [C:03+1] [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [23:16:00] (03CR) 10Zabe: [C:03+2] Initial configuration for btmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048116 (https://phabricator.wikimedia.org/T368038) (owner: 10Zabe) [23:16:41] (03Merged) 10jenkins-bot: Initial configuration for btmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048116 (https://phabricator.wikimedia.org/T368038) (owner: 10Zabe) [23:18:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P65272 and previous config saved to /var/cache/conftool/dbconfig/20240620-231817-marostegui.json [23:20:38] !log create Wikipedia Mandailing # T368038 [23:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:43] T368038: Create Wikipedia Mandailing - https://phabricator.wikimedia.org/T368038 [23:20:58] !log zabe@deploy1002 Started scap: Creating btmwiki (T368038) [23:23:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:33:19] !log zabe@deploy1002 Finished scap: Creating btmwiki (T368038) (duration: 12m 20s) [23:33:24] T368038: Create Wikipedia Mandailing - https://phabricator.wikimedia.org/T368038 [23:33:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364069)', diff saved to https://phabricator.wikimedia.org/P65273 and previous config saved to /var/cache/conftool/dbconfig/20240620-233324-marostegui.json [23:33:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [23:33:30] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:33:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [23:33:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T364069)', diff saved to https://phabricator.wikimedia.org/P65274 and previous config saved to /var/cache/conftool/dbconfig/20240620-233346-marostegui.json [23:34:59] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=btmwiki --cluster=all 2>&1 | tee /tmp/btmwiki.UpdateSearchIndexConfig.log # T368038 [23:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:41] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048121 [23:35:42] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048121 (owner: 10Zabe) [23:36:29] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048121 (owner: 10Zabe) [23:36:57] !log zabe@deploy1002 Started scap: Update interwiki cache [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048122 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048122 (owner: 10TrainBranchBot) [23:45:19] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:47:09] !log zabe@deploy1002 Finished scap: Update interwiki cache (duration: 10m 12s) [23:47:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:21] PROBLEM - Webrequests Varnishkafka log producer on cp3067 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:48:21] PROBLEM - eventlogging Varnishkafka log producer on cp3072 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:48:22] PROBLEM - eventlogging Varnishkafka log producer on cp3068 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:48:24] PROBLEM - eventlogging Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:48:25] PROBLEM - eventlogging Varnishkafka log producer on cp3067 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:48:27] PROBLEM - statsv Varnishkafka log producer on cp3070 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:48:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:50:27] PROBLEM - Webrequests Varnishkafka log producer on cp3072 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:50:48] !incidents [23:50:49] 4765 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [23:50:49] 4764 (RESOLVED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [23:50:52] !ack 4765 [23:50:52] 4765 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [23:53:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:53:52] many user reports of 503s [23:53:56] but fine here [23:54:12] ok many is 2 they just said it more than once [23:55:19] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:55:38] AntiComposite: should be recovering