[00:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:00:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:56] RESOLVED: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:59] (03PS3) 10Andrea Denisse: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) [00:05:59] (03PS5) 10Andrea Denisse: grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) [00:06:59] (03CR) 10Andrea Denisse: grafana: Toggle data sync using feature flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [00:07:20] (03CR) 10Andrea Denisse: grafana: Add enable_dashboard_sync feature flag in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [00:09:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142002 [00:09:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142002 (owner: 10TrainBranchBot) [00:23:45] ryankemper@cumin2002 reimage (PID 3346232) is awaiting input [00:24:40] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [00:30:25] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142002 (owner: 10TrainBranchBot) [00:43:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10794413 (10wiki_willy) Hi @Papaul - do you have any other recommendations for this one? >>! In T393296#10789353, @Marostegui wrote: > We cannot keep working and impacting production users like this a... [00:43:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:46:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:50:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2151:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2151 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:00:23] (03CR) 10Cwhite: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [01:00:32] (03CR) 10Cwhite: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [01:08:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.28 [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142027 (https://phabricator.wikimedia.org/T386223) [01:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.28 [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142027 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [01:19:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10794425 (10Papaul) @VRiley-WMF is it possible to change the server environment, move the server from rack D3 to another rack in row D? This will not affect any server configuration, server will still... [01:23:56] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.28 [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142027 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [01:44:22] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:22] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:22] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:30] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1016 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:30] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:30] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:40] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:40] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:44:56] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:45:12] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:45:25] FIRING: [22x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:48] FIRING: [5x] PuppetFailure: Puppet has failed on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:49:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:28] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:57:21] 06SRE-OnFire, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10794469 (10RLazarus) 05Open→03Resolved a:03Joe Resolved with https://gerrit.wikimedia.org/r/1117548. [01:59:28] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:59:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T0200) [02:00:25] FIRING: [22x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:43] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:04:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:25] FIRING: [22x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:58] RESOLVED: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:14:43] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:22:01] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wdqs-all from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:22:03] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wdqs-all from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:22:04] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [02:24:00] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [02:24:13] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 12s) [02:24:18] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wdqs-all from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:24:20] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wdqs-all from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:24:22] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1016 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:24:30] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1016 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:24:43] RESOLVED: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:24:48] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1016 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:25:06] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1016 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:27:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wdqs-all from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:27:18] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [02:32:09] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wdqs-all from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:37:47] (03PS1) 10Ryan Kemper: wdqs: fix data loaded flag for wdqs-all option [cookbooks] - 10https://gerrit.wikimedia.org/r/1142070 [02:39:42] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:39:57] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:41:22] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:41:25] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [02:43:45] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wikidata from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:44:35] (03PS2) 10Ryan Kemper: wdqs: fix data loaded flag for wdqs-all option [cookbooks] - 10https://gerrit.wikimedia.org/r/1142070 [02:44:42] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:49:42] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:49:47] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:49:50] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [02:52:14] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wikidata from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [02:54:42] RESOLVED: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T0300) [03:00:26] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [03:00:29] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [03:01:26] PROBLEM - Restbase root url on restbase1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/RESTBase [03:01:36] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142080 (https://phabricator.wikimedia.org/T386223) [03:01:38] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142080 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [03:02:24] RECOVERY - Restbase root url on restbase1042 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 7.713 second response time https://wikitech.wikimedia.org/wiki/RESTBase [03:03:05] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142080 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [03:03:29] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.28 refs T386223 [03:03:32] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [03:04:42] FIRING: [4x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:05:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [03:08:24] (03PS1) 10Ryan Kemper: wdqs: add scap dsh targets for new wdqs-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1142085 (https://phabricator.wikimedia.org/T388134) [03:09:08] (03CR) 10Ryan Kemper: [C:03+1] wdqs: add scap dsh targets for new wdqs-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1142085 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [03:09:10] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add scap dsh targets for new wdqs-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1142085 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [03:09:42] RESOLVED: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:09:54] (03PS2) 10Ryan Kemper: wdqs-main: bring old internal hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134) [03:14:42] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:16:22] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [03:16:37] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 14s) [03:16:40] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2008 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:16:49] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [03:16:56] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2008 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:03] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 13s) [03:17:06] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2014 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:06] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:20] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [03:17:22] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:22] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:34] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 13s) [03:17:40] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:50] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2014 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:50] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:52] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [03:18:05] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 12s) [03:18:06] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1011 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:06] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2015 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:12] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:22] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2015 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:30] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:30] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:35] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling source-only afterwards [03:18:38] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [03:18:48] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1011 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:25] FIRING: [14x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:42] RESOLVED: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:25:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling source-only afterwards [03:25:47] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [03:27:22] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet, repooling source-only afterwards [03:29:51] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:30:25] FIRING: [13x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:48] FIRING: [3x] PuppetFailure: Puppet has failed on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:35:25] FIRING: [13x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet, repooling source-only afterwards [03:35:42] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [03:36:02] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet, repooling source-only afterwards [03:38:48] FIRING: [3x] PuppetFailure: Puppet has failed on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:39:42] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:40:25] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:43:48] RESOLVED: [3x] PuppetFailure: Puppet has failed on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:44:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet, repooling source-only afterwards [03:44:42] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [03:45:25] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:25] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on wdqs[2008,2014-2015].codfw.wmnet,wdqs[1011,1016].eqiad.wmnet with reason: T388134 [03:45:38] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2015.codfw.wmnet, repooling source-only afterwards [03:49:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:52:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2015.codfw.wmnet, repooling source-only afterwards [03:52:48] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T0400) [04:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:13] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.44.0-wmf.28 refs T386223 (duration: 62m 44s) [04:06:17] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [04:07:37] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.24 (duration: 07m 35s) [04:43:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:46:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:27:36] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:57:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T0600) [06:25:13] (03PS1) 10Jgiannelos: pcs-rb-sunset: Rollout all wikis except en/zh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142213 [06:25:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:10] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:28:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:31:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:15] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10794653 (10elukey) Today I was reviewing the alerts for perccli-related nagios checks, and I found non-ms-be nodes that will likely keep the current control... [07:04:26] yep [07:07:32] (03CR) 10Ayounsi: [C:03+1] "Thanks! Lgtm, +1 if you've tested it successfully." [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [07:09:27] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10794670 (10Nikerabbit) [07:11:05] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10794674 (10Nikerabbit) [07:11:11] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10794676 (10Nikerabbit) [07:22:49] (03CR) 10Brouberol: "Either that or we amend this change. Both sound fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [07:25:08] (03CR) 10Brouberol: [C:03+1] "TIL we have local SMTP set up on an-launcher:" [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [07:25:10] (03CR) 10Brouberol: [C:03+2] Default smtp to localhost for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [07:27:38] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:28:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142457 [07:29:51] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:32:49] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10794703 (10Aklapper) @JVanderhoop-WMF: Which documentation located where was followed to created this task? As far as I know we do not ask anywhere... [07:32:52] (03PS1) 10Brouberol: airflow: include config/secret annotation on the gitsync deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142483 (https://phabricator.wikimedia.org/T390932) [07:33:25] puppetmaster1001 alerts are expected? [07:37:14] (03PS2) 10Jgiannelos: pcs-rb-sunset: Rollout all wikis except en/zh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142213 (https://phabricator.wikimedia.org/T390724) [07:37:15] (03PS1) 10Stevemunene: hdfs: Exclude rack F3 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1142485 (https://phabricator.wikimedia.org/T390171) [07:38:51] (03PS2) 10Stevemunene: zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) [07:38:51] (03PS2) 10Stevemunene: zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) [07:38:51] (03PS2) 10Stevemunene: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) [07:38:52] (03PS2) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) [07:40:02] (03CR) 10Majavah: [C:03+2] quarry: Drop obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/1138241 (owner: 10Majavah) [07:41:39] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Drop support for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1121630 (owner: 10Majavah) [07:43:40] (03Abandoned) 10Majavah: realm: stop setting labsproject [puppet] - 10https://gerrit.wikimedia.org/r/916425 (owner: 10Majavah) [07:45:15] vgutierrez: not that I know, lemme check [07:45:25] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:00] ah wow passenger is not up [07:47:37] no sorry it is up [07:49:06] !log restart apache2 on puppetmaster1001 [07:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:51:21] vgutierrez: ah ok "Get \"https://10.64.16.73:8141/puppet/v3\": x509: certificate relies on legacy Common Name field, use SANs instead" [07:51:48] IIRC it surfaced in the past, I think there was a downtime or similar [07:52:41] (03CR) 10Filippo Giunchedi: [C:03+2] kubernetes: remove usage of prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:56:21] (03PS1) 10Filippo Giunchedi: Remove prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142497 (https://phabricator.wikimedia.org/T389170) [07:58:35] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [07:58:38] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [08:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:30] (03PS1) 10Jelto: gitlab: disable object storage on replica gitlab1003 [puppet] - 10https://gerrit.wikimedia.org/r/1142498 (https://phabricator.wikimedia.org/T378922) [08:02:36] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1142498 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:09:42] (03CR) 10Tiziano Fogli: [C:03+1] Remove prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142497 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [08:10:07] (03PS10) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866) [08:10:22] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866) (owner: 10Tiziano Fogli) [08:10:27] (03CR) 10Filippo Giunchedi: [C:03+2] Remove prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142497 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [08:13:10] (03CR) 10Volans: [C:03+1] "LGTM if reverse pointers don't need to be wiped." [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [08:23:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:48] (03CR) 10Tiziano Fogli: [C:03+2] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866) (owner: 10Tiziano Fogli) [08:25:20] (03CR) 10Cathal Mooney: gNMIc: collect optics status on Juniper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:30:26] (03CR) 10Filippo Giunchedi: "httpd config LGTM, though local auth is also used by opensearch-dashboards for beta-logs.wmcloud.org which will need adjustment too" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [08:30:46] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10794812 (10BTullis) a:03BTullis I can pick up this ticket and work with Julie to get the access that she needs. In answer to your question @Akla... [08:38:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10794821 (10cmooney) 05Open→03Resolved Happy to say this is still looking clean. {F59718360 width=600} So whatever happened previously one... [08:38:00] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable object storage on replica gitlab1003 [puppet] - 10https://gerrit.wikimedia.org/r/1142498 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:39:00] (03CR) 10Arnaudb: [C:03+1] gitlab: disable object storage on replica gitlab1003 [puppet] - 10https://gerrit.wikimedia.org/r/1142498 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:40:30] (03PS1) 10Majavah: Revert "common: Temporarily remove some keys" [homer/public] - 10https://gerrit.wikimedia.org/r/1142517 [08:41:27] (03PS1) 10Elukey: raid: update facter and get-raid-status to allow storcli [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) [08:44:37] (03PS2) 10Majavah: P:microsites: peopleweb: Refresh image on front page [puppet] - 10https://gerrit.wikimedia.org/r/1141280 [08:44:37] (03PS2) 10Majavah: P:microsites: peopleweb: Set content as utf-8 by default [puppet] - 10https://gerrit.wikimedia.org/r/1141311 [08:47:03] (03CR) 10Elukey: "Tried to run https://puppet-compiler.wmflabs.org/output/1142518/5455/ms-be1090.eqiad.wmnet/change.ms-be1090.eqiad.wmnet.err but I think th" [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [08:48:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:20] (03CR) 10Elukey: [C:03+1] setup.py: update redis dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141949 (owner: 10Volans) [08:49:40] (03CR) 10Elukey: [C:03+1] setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941 (owner: 10Volans) [08:50:13] (03CR) 10Volans: [C:03+2] setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941 (owner: 10Volans) [08:50:20] (03CR) 10Volans: [C:03+2] setup.py: update redis dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141949 (owner: 10Volans) [08:52:46] (03PS1) 10David Caro: toolsbeta: update prometheus cert [puppet] - 10https://gerrit.wikimedia.org/r/1142520 (https://phabricator.wikimedia.org/T393438) [08:53:21] (03CR) 10Elukey: [C:03+1] tests: refactor global tests for all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1137284 (owner: 10Volans) [08:54:25] (03CR) 10Majavah: [C:03+1] toolsbeta: update prometheus cert [puppet] - 10https://gerrit.wikimedia.org/r/1142520 (https://phabricator.wikimedia.org/T393438) (owner: 10David Caro) [08:54:57] (03CR) 10Hnowlan: [C:03+1] pcs-rb-sunset: Rollout all wikis except en/zh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142213 (https://phabricator.wikimedia.org/T390724) (owner: 10Jgiannelos) [08:55:08] (03CR) 10David Caro: [C:03+2] toolsbeta: update prometheus cert [puppet] - 10https://gerrit.wikimedia.org/r/1142520 (https://phabricator.wikimedia.org/T393438) (owner: 10David Caro) [08:59:49] (03Merged) 10jenkins-bot: setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941 (owner: 10Volans) [09:00:03] (03Merged) 10jenkins-bot: setup.py: update redis dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141949 (owner: 10Volans) [09:02:06] (03CR) 10Ayounsi: gNMIc: collect optics status on Juniper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:04:58] (03CR) 10Stevemunene: [C:03+1] airflow: include config/secret annotation on the gitsync deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142483 (https://phabricator.wikimedia.org/T390932) (owner: 10Brouberol) [09:05:49] (03PS2) 10Volans: tests: refactor global tests for all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1137284 [09:07:19] (03CR) 10Cathal Mooney: [C:03+1] gNMIc: collect optics status on Juniper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:08:39] (03CR) 10Jgiannelos: [C:03+2] pcs-rb-sunset: Rollout all wikis except en/zh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142213 (https://phabricator.wikimedia.org/T390724) (owner: 10Jgiannelos) [09:10:14] (03Merged) 10jenkins-bot: pcs-rb-sunset: Rollout all wikis except en/zh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142213 (https://phabricator.wikimedia.org/T390724) (owner: 10Jgiannelos) [09:11:39] (03CR) 10Btullis: [C:03+1] hdfs: Exclude rack F3 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1142485 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [09:13:00] (03CR) 10Volans: [C:03+2] tests: refactor global tests for all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1137284 (owner: 10Volans) [09:15:10] (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142531 (https://phabricator.wikimedia.org/T391532) [09:16:17] (03CR) 10Hasan Akgün (WMDE): [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142531 (https://phabricator.wikimedia.org/T391532) (owner: 10Lucas Werkmeister (WMDE)) [09:17:29] jouncebot: nowandnext [09:17:29] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [09:17:29] In 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1000) [09:17:42] I’ll probably deploy that ^ deployment-charts change in a moment if nobody objects [09:19:21] (03PS1) 10Majavah: P:wmcs::instance: Drop unneeded syslog overrides [puppet] - 10https://gerrit.wikimedia.org/r/1142534 [09:20:18] (03PS1) 10Majavah: P:wmcs::instance: Don't install puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/1142535 [09:20:53] (03Merged) 10jenkins-bot: tests: refactor global tests for all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1137284 (owner: 10Volans) [09:22:57] hnowlan: can i go ahead and deploy changeprop? [09:23:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "Let’s try this one and see if it works better." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142531 (https://phabricator.wikimedia.org/T391532) (owner: 10Lucas Werkmeister (WMDE)) [09:25:31] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142531 (https://phabricator.wikimedia.org/T391532) (owner: 10Lucas Werkmeister (WMDE)) [09:25:46] deploying that… [09:26:15] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:26:34] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:27:26] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [09:27:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:27:45] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [09:28:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:01] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [09:28:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:20] (03PS1) 10Ayounsi: esams: remove Tele2 transit [homer/public] - 10https://gerrit.wikimedia.org/r/1142539 (https://phabricator.wikimedia.org/T393401) [09:28:42] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [09:28:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:30:13] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142457 (owner: 10PipelineBot) [09:30:59] * Lucas_WMDE done deploying [09:33:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:04] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:34:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T382778)', diff saved to https://phabricator.wikimedia.org/P75754 and previous config saved to /var/cache/conftool/dbconfig/20250506-093410-ladsgroup.json [09:34:13] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [09:35:57] (03PS1) 10Filippo Giunchedi: tox: update for Trixie [alerts] - 10https://gerrit.wikimedia.org/r/1142542 [09:35:57] (03PS1) 10Filippo Giunchedi: sre: alert on Prometheus codfw/eqiad down [alerts] - 10https://gerrit.wikimedia.org/r/1142543 (https://phabricator.wikimedia.org/T393365) [09:37:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T382778)', diff saved to https://phabricator.wikimedia.org/P75755 and previous config saved to /var/cache/conftool/dbconfig/20250506-093704-ladsgroup.json [09:37:40] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] tox: update for Trixie [alerts] - 10https://gerrit.wikimedia.org/r/1142542 (owner: 10Filippo Giunchedi) [09:40:06] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database nupwiki (T390714) [09:40:09] T390714: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714 [09:40:10] nemo-yiannis: go ahead [09:40:17] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database nupwiki (T390714) [09:40:22] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1141311 (owner: 10Majavah) [09:40:24] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [09:40:44] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [09:40:49] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [09:41:56] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [09:42:02] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:42:51] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:43:36] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [09:44:12] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [09:44:15] (03PS1) 10Majavah: openstack: Use IPv6 dualstack network for image creation [puppet] - 10https://gerrit.wikimedia.org/r/1142546 [09:46:50] (03CR) 10Jelto: "looks mostly good, one typo in `alt`" [puppet] - 10https://gerrit.wikimedia.org/r/1141280 (owner: 10Majavah) [09:49:30] (03PS3) 10Majavah: P:microsites: peopleweb: Refresh image on front page [puppet] - 10https://gerrit.wikimedia.org/r/1141280 [09:49:30] (03PS3) 10Majavah: P:microsites: peopleweb: Set content as utf-8 by default [puppet] - 10https://gerrit.wikimedia.org/r/1141311 [09:49:42] (03PS4) 10Majavah: P:microsites: peopleweb: Set content as utf-8 by default [puppet] - 10https://gerrit.wikimedia.org/r/1141311 [09:49:42] (03PS4) 10Majavah: P:microsites: peopleweb: Refresh image on front page [puppet] - 10https://gerrit.wikimedia.org/r/1141280 [09:51:11] (03CR) 10Majavah: P:microsites: peopleweb: Refresh image on front page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1141280 (owner: 10Majavah) [09:52:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P75756 and previous config saved to /var/cache/conftool/dbconfig/20250506-095212-ladsgroup.json [09:52:39] (03CR) 10CI reject: [V:04-1] P:microsites: peopleweb: Set content as utf-8 by default [puppet] - 10https://gerrit.wikimedia.org/r/1141311 (owner: 10Majavah) [09:53:01] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1141311 (owner: 10Majavah) [09:53:09] (03CR) 10Majavah: [C:03+2] P:microsites: peopleweb: Set content as utf-8 by default [puppet] - 10https://gerrit.wikimedia.org/r/1141311 (owner: 10Majavah) [09:54:38] (03CR) 10Brouberol: [C:03+2] airflow: include config/secret annotation on the gitsync deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142483 (https://phabricator.wikimedia.org/T390932) (owner: 10Brouberol) [09:55:28] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:55:49] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:56:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 (owner: 10Ayounsi) [09:56:01] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [09:56:09] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [09:56:15] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [09:56:52] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [09:57:14] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [09:57:44] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [09:59:21] (03PS1) 10Filippo Giunchedi: pontoon: remove minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142553 [09:59:42] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: update team for pagetriage jobs [puppet] - 10https://gerrit.wikimedia.org/r/1141946 (https://phabricator.wikimedia.org/T393395) (owner: 10Scott French) [09:59:50] (03PS2) 10Filippo Giunchedi: pontoon: remove minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142553 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1000) [10:00:35] (03CR) 10Hnowlan: [C:03+1] alertmanager: add receiver and routing for moderator-tools tasks [puppet] - 10https://gerrit.wikimedia.org/r/1141945 (https://phabricator.wikimedia.org/T393395) (owner: 10Scott French) [10:04:56] (03CR) 10Volans: "FYI this change broke some cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [10:05:18] (03CR) 10Nikerabbit: [C:03+1] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [10:07:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P75757 and previous config saved to /var/cache/conftool/dbconfig/20250506-100719-ladsgroup.json [10:15:29] (03CR) 10Sergio Gimeno: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [10:18:46] (03PS1) 10Volans: elasticsearch: temporarily remove it from bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) [10:19:09] (03PS5) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) [10:19:10] (03CR) 10Ladsgroup: [C:03+2] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [10:19:13] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [10:22:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:22:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T382778)', diff saved to https://phabricator.wikimedia.org/P75758 and previous config saved to /var/cache/conftool/dbconfig/20250506-102226-ladsgroup.json [10:22:29] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [10:22:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:22:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T382778)', diff saved to https://phabricator.wikimedia.org/P75759 and previous config saved to /var/cache/conftool/dbconfig/20250506-102236-ladsgroup.json [10:26:18] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140992 (https://phabricator.wikimedia.org/T393154) (owner: 10Ilias Sarantopoulos) [10:26:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T382778)', diff saved to https://phabricator.wikimedia.org/P75760 and previous config saved to /var/cache/conftool/dbconfig/20250506-102624-ladsgroup.json [10:29:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:29:51] FIRING: [20x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:32:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [10:39:33] (03PS2) 10Hnowlan: mw::maintenance: migrate one image suggestions job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) [10:41:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P75761 and previous config saved to /var/cache/conftool/dbconfig/20250506-104131-ladsgroup.json [10:43:26] (03PS2) 10Hnowlan: mw::maintenance: migrate all image suggestions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) [10:43:52] (03PS3) 10Hnowlan: mw::maintenance: migrate one image suggestions job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) [10:49:30] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1141280 (owner: 10Majavah) [10:50:00] (03CR) 10Majavah: [C:03+2] P:microsites: peopleweb: Refresh image on front page [puppet] - 10https://gerrit.wikimedia.org/r/1141280 (owner: 10Majavah) [10:55:30] (03PS2) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) [10:55:30] (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC with Manuel." [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [10:55:58] (03CR) 10Federico Ceratto: "(pressed +1 by mistake, resetting to 0)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [10:56:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P75762 and previous config saved to /var/cache/conftool/dbconfig/20250506-105639-ladsgroup.json [10:58:15] (03PS5) 10Hnowlan: mw::maintenance: migrate purgeExpiredBlocks to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) [10:58:20] (03CR) 10Hnowlan: mw::maintenance: migrate purgeExpiredBlocks to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [11:05:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [11:05:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [11:07:07] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: disable gpu in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140992 (https://phabricator.wikimedia.org/T393154) (owner: 10Ilias Sarantopoulos) [11:08:56] (03Merged) 10jenkins-bot: ml-services: disable gpu in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140992 (https://phabricator.wikimedia.org/T393154) (owner: 10Ilias Sarantopoulos) [11:11:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T382778)', diff saved to https://phabricator.wikimedia.org/P75763 and previous config saved to /var/cache/conftool/dbconfig/20250506-111146-ladsgroup.json [11:11:49] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [11:11:50] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:11:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T382778)', diff saved to https://phabricator.wikimedia.org/P75764 and previous config saved to /var/cache/conftool/dbconfig/20250506-111157-ladsgroup.json [11:15:23] (03PS1) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) [11:15:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T382778)', diff saved to https://phabricator.wikimedia.org/P75765 and previous config saved to /var/cache/conftool/dbconfig/20250506-111524-ladsgroup.json [11:15:37] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [11:16:00] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:17:34] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [11:21:12] (03CR) 10Stevemunene: [C:03+2] hdfs: Exclude rack F3 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1142485 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [11:21:14] (03PS1) 10Hnowlan: mw::maintenance: migrate readinglists job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142568 (https://phabricator.wikimedia.org/T388541) [11:21:51] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate readinglists job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142568 (https://phabricator.wikimedia.org/T388541) (owner: 10Hnowlan) [11:22:40] (03PS2) 10Hnowlan: mw::maintenance: migrate readinglists job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142568 (https://phabricator.wikimedia.org/T388541) [11:23:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10795317 (10Stevemunene) [11:24:41] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795323 (10BTullis) [11:25:25] (03PS1) 10Majavah: dynamicproxy: Add missing redis_shutdown() call [puppet] - 10https://gerrit.wikimedia.org/r/1142572 (https://phabricator.wikimedia.org/T393024) [11:26:18] (03CR) 10David Caro: [C:03+1] "LGTM :crossingfingers:" [puppet] - 10https://gerrit.wikimedia.org/r/1142572 (https://phabricator.wikimedia.org/T393024) (owner: 10Majavah) [11:26:48] (03CR) 10Majavah: [C:03+2] dynamicproxy: Add missing redis_shutdown() call [puppet] - 10https://gerrit.wikimedia.org/r/1142572 (https://phabricator.wikimedia.org/T393024) (owner: 10Majavah) [11:26:50] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10795326 (10Jelto) I disabled object storage on `gitlab1003` again. The migration back included setting the `profile::gitlab::objec... [11:28:20] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10795341 (10Jelto) [11:28:32] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10795345 (10Jelto) [11:29:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:29:44] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795350 (10BTullis) [11:30:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P75766 and previous config saved to /var/cache/conftool/dbconfig/20250506-113031-ladsgroup.json [11:30:47] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795353 (10BTullis) [11:35:24] (03CR) 10Kamila Součková: [C:03+2] CampaignEvents: Migrate aggregateparticipantanswers-testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [11:36:15] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 6:00:00 on backup1013.eqiad.wmnet with reason: Upgrade and restart [11:36:50] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795378 (10BTullis) Hi @JVanderhoop-WMF - Please could you do the following? * Read and signed: {L3} * Read: https://wikitech.wikimedia.org/wiki/Da... [11:37:09] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 6:00:00 on backup[2010-2014].codfw.wmnet with reason: Upgrade and restart [11:38:12] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795382 (10BTullis) [11:39:06] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795384 (10BTullis) @Milimetric or @Ahoelzl - Please could either of you approve Julie's membership of the `analytics-privatedata-users` group? Tha... [11:39:17] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10795386 (10BTullis) [11:41:29] (03PS2) 10Awight: Revert "Temporarily revoke ssh key for travel" [puppet] - 10https://gerrit.wikimedia.org/r/1139434 [11:41:37] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Confirmed OOB" [puppet] - 10https://gerrit.wikimedia.org/r/1139434 (owner: 10Awight) [11:42:36] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:54] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [11:43:11] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [11:45:25] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P75767 and previous config saved to /var/cache/conftool/dbconfig/20250506-114538-ladsgroup.json [11:46:06] (03PS1) 10Slyngshede: Login: fix redirect on login [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) [11:46:44] (03PS2) 10Slyngshede: Login: fix redirect on login [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) [11:47:46] (03PS1) 10Kamila Součková: benthos/mw-accesslog-metrics: undo consumer group rename [puppet] - 10https://gerrit.wikimedia.org/r/1142576 (https://phabricator.wikimedia.org/T393366) [11:49:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:52:34] (03PS1) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [11:52:58] (03CR) 10CI reject: [V:04-1] mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:54:53] (03PS2) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [11:58:34] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:58:35] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555#10795518 (10Gehel) p:05Triage→03High [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1200) [12:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:42] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10795544 (10Gehel) p:05Triage→03High [12:00:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T382778)', diff saved to https://phabricator.wikimedia.org/P75768 and previous config saved to /var/cache/conftool/dbconfig/20250506-120045-ladsgroup.json [12:00:49] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [12:01:01] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1195.eqiad.wmnet with reason: Maintenance [12:01:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T382778)', diff saved to https://phabricator.wikimedia.org/P75769 and previous config saved to /var/cache/conftool/dbconfig/20250506-120108-ladsgroup.json [12:04:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T382778)', diff saved to https://phabricator.wikimedia.org/P75770 and previous config saved to /var/cache/conftool/dbconfig/20250506-120434-ladsgroup.json [12:04:52] !log joal@deploy1003 Started deploy [analytics/refinery@43a5f61]: Regular analytics weekly train [analytics/refinery@43a5f617] [12:07:49] !log joal@deploy1003 Finished deploy [analytics/refinery@43a5f61]: Regular analytics weekly train [analytics/refinery@43a5f617] (duration: 02m 56s) [12:08:10] !log joal@deploy1003 Started deploy [analytics/refinery@43a5f61] (thin): Regular analytics weekly train THIN [analytics/refinery@43a5f617] [12:09:30] !log joal@deploy1003 Finished deploy [analytics/refinery@43a5f61] (thin): Regular analytics weekly train THIN [analytics/refinery@43a5f617] (duration: 01m 20s) [12:09:35] !log joal@deploy1003 Started deploy [analytics/refinery@43a5f61] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@43a5f617] [12:11:12] !log joal@deploy1003 Finished deploy [analytics/refinery@43a5f61] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@43a5f617] (duration: 01m 37s) [12:18:59] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10795631 (10Gehel) [12:19:05] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555#10795633 (10Gehel) [12:19:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 (owner: 10Gergő Tisza) [12:19:32] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS7195/IPv6: Connect - EdgeUno, AS7195/IPv4: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:19:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P75771 and previous config saved to /var/cache/conftool/dbconfig/20250506-121940-ladsgroup.json [12:21:36] (03PS2) 10Gergő Tisza: private: Drop $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) [12:21:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza) [12:23:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:24:27] (03PS2) 10Gergő Tisza: Revert "Add .well-known/matrix for wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835) [12:25:31] (03CR) 10Gergő Tisza: "Looks like this fell through the cracks back then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [12:26:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [12:27:20] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [12:28:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:29:06] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1142576 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [12:34:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P75772 and previous config saved to /var/cache/conftool/dbconfig/20250506-123448-ladsgroup.json [12:36:25] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10795687 (10tappof) I merged the patch for {T387866}, the PDUs located in drmrs are now split by rack. [12:36:36] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:39:36] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:42:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T382778)', diff saved to https://phabricator.wikimedia.org/P75773 and previous config saved to /var/cache/conftool/dbconfig/20250506-124954-ladsgroup.json [12:49:57] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [12:50:11] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1196.eqiad.wmnet with reason: Maintenance [12:50:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:50:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T382778)', diff saved to https://phabricator.wikimedia.org/P75774 and previous config saved to /var/cache/conftool/dbconfig/20250506-125034-ladsgroup.json [12:53:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T382778)', diff saved to https://phabricator.wikimedia.org/P75775 and previous config saved to /var/cache/conftool/dbconfig/20250506-125358-ladsgroup.json [12:54:05] (03PS1) 10Arnaudb: gerrit: drop abuser [puppet] - 10https://gerrit.wikimedia.org/r/1142588 [12:54:53] (03CR) 10Ayounsi: [C:03+2] Account for non defined dict keys [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 (owner: 10Ayounsi) [12:55:28] (03Merged) 10jenkins-bot: Account for non defined dict keys [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 (owner: 10Ayounsi) [12:59:06] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:59:38] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 (owner: 10Ayounsi) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1300). [13:00:05] tgr, DreamRimmer, Cyndywikime, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:15] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1142588 (owner: 10Arnaudb) [13:00:21] (03CR) 10Ayounsi: [C:03+2] gNMIc: collect optics status on Juniper [puppet] - 10https://gerrit.wikimedia.org/r/1140688 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:00:23] o/ [13:00:35] I have a meeting in 30 minutes from now, so I might not be able to deploy my config changes. no big deal, they’re optional [13:00:41] o/ [13:00:44] (03CR) 10Arnaudb: [C:03+2] gerrit: drop abuser [puppet] - 10https://gerrit.wikimedia.org/r/1142588 (owner: 10Arnaudb) [13:00:45] I can deploy [13:00:46] tgr_: do you want to start by self-servicing yours? [13:00:47] (03PS2) 10Elukey: raid: update facter and get-raid-status to allow storcli [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) [13:00:52] seeing as how half the changes are mine [13:00:58] 👍 [13:01:03] Lucas_WMDE: do your patches need testing? [13:01:09] don’t think so [13:01:20] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-staging-worker [13:01:27] at most, “nothing broke” testing, but even then I doubt the file I touched even gets run in production [13:02:44] (03CR) 10Kamila Součková: [C:03+2] benthos/mw-accesslog-metrics: undo consumer group rename [puppet] - 10https://gerrit.wikimedia.org/r/1142576 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [13:02:47] it's used for creating patches, right? [13:02:52] yeah [13:03:37] the changes are trivial, in any case [13:03:40] PROBLEM - Host ml-staging2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:43] o/ [13:04:47] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:05:08] RECOVERY - Host ml-staging2003 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [13:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 (owner: 10Gergő Tisza) [13:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [13:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) (owner: 10Novem Linguae) [13:05:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [13:05:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:05:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:06:24] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:06:26] (03Merged) 10jenkins-bot: CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 (owner: 10Gergő Tisza) [13:06:28] (03Merged) 10jenkins-bot: Revert "Add .well-known/matrix for wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [13:06:30] (03Merged) 10jenkins-bot: core-Permissions: add move-subpages to enwiki templateeditor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) (owner: 10Novem Linguae) [13:06:33] (03Merged) 10jenkins-bot: Growth-Beta: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [13:06:35] (03Merged) 10jenkins-bot: manage-dblist: Fix indentation and stray blank line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:06:37] (03Merged) 10jenkins-bot: manage-dblist: Fix some random phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:06:57] (03CR) 10Elukey: [C:03+1] "Very sad that we have to do it but LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [13:07:22] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1141700|CommonSettings: Document wmfGetPrivilegedGroups usage]], [[gerrit:623147|Revert "Add .well-known/matrix for wikimedia.org" (T223835 T261531)]], [[gerrit:1141034|core-Permissions: add move-subpages to enwiki templateeditor user group (T393167)]], [[gerrit:1136986|Growth-Beta: Configure higher Impact Module edit limits for pilot wikis (T341599)]], [[ [13:07:22] gerrit:1139487|manage-dblist: Fix indentation and stray blank line (T392819)]], [[gerrit:1139488|manage-dblist: Fix some random phpcs violations (T392819)]] [13:07:29] T223835: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 [13:07:29] T261531: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 [13:07:29] T393167: [enwiki] Grant move-subpages to template editor user group - https://phabricator.wikimedia.org/T393167 [13:07:30] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:07:30] T392819: phpcs does not check manage-dblist in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T392819 [13:08:21] (03PS1) 10Joal: Add termination_state field to turnilo webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1142593 (https://phabricator.wikimedia.org/T387454) [13:08:31] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10795879 (10Andrew) 05Open→03Resolved [13:08:49] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [13:08:52] btullis or brouberol, could one of you review, merge and deploy --^ please? [13:09:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P75776 and previous config saved to /var/cache/conftool/dbconfig/20250506-130905-ladsgroup.json [13:14:50] !log tgr@deploy1003 tgr, novemlinguae, cyndywikime, lucaswerkmeister-wmde: Backport for [[gerrit:1141700|CommonSettings: Document wmfGetPrivilegedGroups usage]], [[gerrit:623147|Revert "Add .well-known/matrix for wikimedia.org" (T223835 T261531)]], [[gerrit:1141034|core-Permissions: add move-subpages to enwiki templateeditor user group (T393167)]], [[gerrit:1136986|Growth-Beta: Configure higher Impact Module edit limits f [13:14:50] or pilot wikis (T341599)]], [[gerrit:1139487|manage-dblist: Fix indentation and stray blank line (T392819)]], [[gerrit:1139488|manage-dblist: Fix some random phpcs violations (T392819)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:55] T223835: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 [13:14:55] T261531: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 [13:14:55] T393167: [enwiki] Grant move-subpages to template editor user group - https://phabricator.wikimedia.org/T393167 [13:14:56] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:14:56] T392819: phpcs does not check manage-dblist in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T392819 [13:15:14] site isn’t completely broken, that’s as much testing as I can do ^^ [13:16:18] enwik one looks good [13:16:31] !log tgr@deploy1003 tgr, novemlinguae, cyndywikime, lucaswerkmeister-wmde: Continuing with sync [13:16:50] "site is completely broken" issues would get caught by the canary checks anyway [13:18:00] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:00] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:04] yeah [13:19:07] (03CR) 10Bking: [C:03+2] envoy: Add service proxys for cirrussearch read traffic [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [13:20:57] (03PS8) 10Volans: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [13:21:06] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:21:31] (03CR) 10Volans: "Fixed typo, added tests, should be ready for prime time testing and if that goes well merging." [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [13:23:40] (03PS3) 10Elukey: modules: comment out gatewayHosts->domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) [13:24:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P75777 and previous config saved to /var/cache/conftool/dbconfig/20250506-132413-ladsgroup.json [13:25:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10795947 (10Papaul) @Marostegui do you have a problem with us relocating this server to another rack in the same row? Thank you. [13:25:32] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141700|CommonSettings: Document wmfGetPrivilegedGroups usage]], [[gerrit:623147|Revert "Add .well-known/matrix for wikimedia.org" (T223835 T261531)]], [[gerrit:1141034|core-Permissions: add move-subpages to enwiki templateeditor user group (T393167)]], [[gerrit:1136986|Growth-Beta: Configure higher Impact Module edit limits for pilot wikis (T341599)]], [ [13:25:32] [gerrit:1139487|manage-dblist: Fix indentation and stray blank line (T392819)]], [[gerrit:1139488|manage-dblist: Fix some random phpcs violations (T392819)]] (duration: 18m 10s) [13:25:38] T223835: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 [13:25:38] T261531: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 [13:25:39] T393167: [enwiki] Grant move-subpages to template editor user group - https://phabricator.wikimedia.org/T393167 [13:25:39] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:25:39] T392819: phpcs does not check manage-dblist in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T392819 [13:26:29] (03PS1) 10Elukey: profile::pyrra::filesystem::slo: enable alerts for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1142596 (https://phabricator.wikimedia.org/T391852) [13:27:42] joal on it [13:27:50] thanks brouberol [13:27:53] (03CR) 10Brouberol: [C:03+1] Add termination_state field to turnilo webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1142593 (https://phabricator.wikimedia.org/T387454) (owner: 10Joal) [13:27:54] (03CR) 10Brouberol: [C:03+2] Add termination_state field to turnilo webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1142593 (https://phabricator.wikimedia.org/T387454) (owner: 10Joal) [13:28:10] done [13:28:15] np! [13:30:14] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:30:31] brouberol: have you restarted turnilo? [13:31:00] I'm letting puppet run on the nodes first. It does not restart by itself? [13:31:14] (03PS1) 10Majavah: dynamicproxy: Declare functions as local [puppet] - 10https://gerrit.wikimedia.org/r/1142598 (https://phabricator.wikimedia.org/T393024) [13:31:47] (03CR) 10Bking: [C:03+1] wdqs: fix data loaded flag for wdqs-all option [cookbooks] - 10https://gerrit.wikimedia.org/r/1142070 (owner: 10Ryan Kemper) [13:32:08] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:32:11] brouberol: I don't think it does [13:32:23] ok. Let me do the thing then [13:33:24] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:34:29] joal: done [13:34:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza) [13:35:02] (03CR) 10Majavah: [C:03+2] dynamicproxy: Declare functions as local [puppet] - 10https://gerrit.wikimedia.org/r/1142598 (https://phabricator.wikimedia.org/T393024) (owner: 10Majavah) [13:35:13] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1142598 (https://phabricator.wikimedia.org/T393024) (owner: 10Majavah) [13:35:34] (03CR) 10Alexandros Kosiaris: [C:03+2] modules: comment out gatewayHosts->domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:35:49] thanks lot brouberol [13:36:12] (03Merged) 10jenkins-bot: private: Drop $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza) [13:36:14] np! [13:36:14] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:36] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1136397|private: Drop $wgCentralAuthSul3SharedDomainRestrictions (T390329)]] [13:36:38] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:39] T390329: SharedDomainHookHandler::DISALLOWED_LOCAL_PROVIDERS is hard to maintain - https://phabricator.wikimedia.org/T390329 [13:36:41] FIRING: [4x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:36:57] (03Merged) 10jenkins-bot: modules: comment out gatewayHosts->domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:39:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T382778)', diff saved to https://phabricator.wikimedia.org/P75778 and previous config saved to /var/cache/conftool/dbconfig/20250506-133920-ladsgroup.json [13:39:23] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [13:39:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:39:38] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:39:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T382778)', diff saved to https://phabricator.wikimedia.org/P75779 and previous config saved to /var/cache/conftool/dbconfig/20250506-133943-ladsgroup.json [13:41:25] 06SRE, 10Observability-Metrics, 13Patch-For-Review: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10796071 (10elukey) @herron @RLazarus There are a couple of logistical things to discuss: - https://gerrit.wikimedia.org/r/1142596 is su... [13:41:41] FIRING: [56x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:42:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T382778)', diff saved to https://phabricator.wikimedia.org/P75780 and previous config saved to /var/cache/conftool/dbconfig/20250506-134207-ladsgroup.json [13:42:23] joal: thanks for the work on webrequest termination_state in turnilo! [13:42:30] FIRING: Emergency syslog message: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:43:10] !log tgr@deploy1003 tgr: Backport for [[gerrit:1136397|private: Drop $wgCentralAuthSul3SharedDomainRestrictions (T390329)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:13] T390329: SharedDomainHookHandler::DISALLOWED_LOCAL_PROVIDERS is hard to maintain - https://phabricator.wikimedia.org/T390329 [13:44:21] (03PS1) 10Ayounsi: Revert "Account for non defined dict keys" [homer/public] - 10https://gerrit.wikimedia.org/r/1142599 [13:44:45] !log tgr@deploy1003 tgr: Continuing with sync [13:44:56] (03PS9) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [13:45:52] (03CR) 10Ayounsi: "The proper fix is in:" [homer/public] - 10https://gerrit.wikimedia.org/r/1142599 (owner: 10Ayounsi) [13:46:18] (03CR) 10Ayounsi: [C:03+2] Revert "Account for non defined dict keys" [homer/public] - 10https://gerrit.wikimedia.org/r/1142599 (owner: 10Ayounsi) [13:46:28] PROBLEM - BGP status on lsw1-c5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:46:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:47] (03Merged) 10jenkins-bot: Revert "Account for non defined dict keys" [homer/public] - 10https://gerrit.wikimedia.org/r/1142599 (owner: 10Ayounsi) [13:47:28] RECOVERY - BGP status on lsw1-c5-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:30] RESOLVED: Emergency syslog message: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:49:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10796103 (10Marostegui) Go for it! [13:51:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:08] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136397|private: Drop $wgCentralAuthSul3SharedDomainRestrictions (T390329)]] (duration: 16m 32s) [13:53:11] T390329: SharedDomainHookHandler::DISALLOWED_LOCAL_PROVIDERS is hard to maintain - https://phabricator.wikimedia.org/T390329 [13:54:08] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:55:24] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10796132 (10mpopov) @BTullis: Andreas already has >>! In T393409#10793743, @Ahoelzl wrote: > Approved. [13:55:25] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:27] I will run over the window a little [13:55:48] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10796135 (10Jclark-ctr) @BTullis would you be able to assist with this if i am able to swap drive? [13:56:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [13:57:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P75781 and previous config saved to /var/cache/conftool/dbconfig/20250506-135713-ladsgroup.json [13:59:16] o/ meeting done, in case you need me to take over deploying [13:59:17] FIRING: [2x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1017:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:21] (03Merged) 10jenkins-bot: logging: Add context processor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [13:59:23] PROBLEM - Hadoop NodeManager on an-worker1205 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:59:36] (03PS1) 10AOkoth: aux: add namespace for os-reports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142606 (https://phabricator.wikimedia.org/T350794) [13:59:42] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1136132|logging: Add context processor (T142313)]] [13:59:45] T142313: Add global information to debug logger context - https://phabricator.wikimedia.org/T142313 [14:01:41] FIRING: [56x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:02:45] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:45] RECOVERY - BGP status on lsw1-d6-codfw.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:09] (03PS8) 10Cyndywikime: Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) [14:04:29] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142607 [14:04:37] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142606 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:04:38] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10796183 (10JVanderhoop-WMF) [14:04:59] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10796184 (10BTullis) [14:05:13] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1017 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:05:21] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10796185 (10JVanderhoop-WMF) @BTullis done! >>! In T393409#10795378, @BTullis wrote: > Hi @JVanderhoop-WMF - Please could you do the following? > *... [14:06:17] !log tgr@deploy1003 tgr: Backport for [[gerrit:1136132|logging: Add context processor (T142313)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:20] T142313: Add global information to debug logger context - https://phabricator.wikimedia.org/T142313 [14:06:23] RECOVERY - Hadoop NodeManager on an-worker1205 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:06:41] RESOLVED: [56x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:07:09] (03CR) 10Andrew Bogott: [C:03+2] keystone: update policy.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/1141977 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:09:31] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:12:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P75782 and previous config saved to /var/cache/conftool/dbconfig/20250506-141220-ladsgroup.json [14:12:42] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate purgeExpiredBlocks to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [14:13:27] PROBLEM - BGP status on lsw1-a4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:34] !log tgr@deploy1003 tgr: Continuing with sync [14:13:47] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:14:27] RECOVERY - BGP status on lsw1-a4-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:31] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs1017.eqiad.wmnet with reason: bringing host online after reimage [14:18:53] (03PS2) 10Elukey: profile::pyrra::filesystem::slo: enable alerts for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1142596 (https://phabricator.wikimedia.org/T391852) [14:18:53] (03PS1) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) [14:19:29] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [14:20:18] (03PS2) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) [14:20:20] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136132|logging: Add context processor (T142313)]] (duration: 20m 37s) [14:20:23] T142313: Add global information to debug logger context - https://phabricator.wikimedia.org/T142313 [14:23:18] !log UTC afternoon deploys done [14:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:01] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:03] (03PS1) 10Giuseppe Lavagetto: Release fixes to hiddenparma and a requestctl hotfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1142614 [14:24:13] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release fixes to hiddenparma and a requestctl hotfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1142614 (owner: 10Giuseppe Lavagetto) [14:24:50] (03PS1) 10Ayounsi: gNMIc: optics, set stream-mode/interval/encoding [puppet] - 10https://gerrit.wikimedia.org/r/1142615 [14:24:59] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [14:25:01] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:25:05] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002" [14:25:08] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 [14:25:43] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 [14:25:44] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002" [14:27:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T382778)', diff saved to https://phabricator.wikimedia.org/P75783 and previous config saved to /var/cache/conftool/dbconfig/20250506-142726-ladsgroup.json [14:27:30] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [14:27:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1207.eqiad.wmnet with reason: Maintenance [14:27:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T382778)', diff saved to https://phabricator.wikimedia.org/P75784 and previous config saved to /var/cache/conftool/dbconfig/20250506-142748-ladsgroup.json [14:28:27] (03CR) 10Cathal Mooney: [C:03+1] gNMIc: optics, set stream-mode/interval/encoding [puppet] - 10https://gerrit.wikimedia.org/r/1142615 (owner: 10Ayounsi) [14:28:41] (03CR) 10Scott French: [C:03+2] alertmanager: add receiver and routing for moderator-tools tasks [puppet] - 10https://gerrit.wikimedia.org/r/1141945 (https://phabricator.wikimedia.org/T393395) (owner: 10Scott French) [14:29:51] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:31:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T382778)', diff saved to https://phabricator.wikimedia.org/P75785 and previous config saved to /var/cache/conftool/dbconfig/20250506-143108-ladsgroup.json [14:32:16] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [14:33:25] (03CR) 10Ayounsi: [C:03+2] gNMIc: optics, set stream-mode/interval/encoding [puppet] - 10https://gerrit.wikimedia.org/r/1142615 (owner: 10Ayounsi) [14:34:11] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:34:26] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:34:39] PROBLEM - BGP status on lsw1-c3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:20] (03PS6) 10RLazarus: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [14:35:39] RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:57] !log jnuche@deploy1003 Installing scap version "4.161.0" for 2 host(s) [14:35:59] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1111 to cirrussearch1111 [14:36:40] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating IPs for cloudrabbit200[123]-dev - andrew@cumin1002" [14:37:11] !log jnuche@deploy1003 Installation of scap version "4.161.0" completed for 2 hosts [14:37:29] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:37:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating IPs for cloudrabbit200[123]-dev - andrew@cumin1002" [14:37:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:44] FIRING: [2x] ProbeDown: Service aux-k8s-ctrl2003:6443 has failed probes (http_aux_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:01] !incidents [14:38:01] 6094 (UNACKED) [2x] ProbeDown sre (aux-k8s-ctrl2003:6443 probes/custom codfw) [14:38:08] !ack 6094 [14:38:09] 6094 (ACKED) [2x] ProbeDown sre (aux-k8s-ctrl2003:6443 probes/custom codfw) [14:38:39] looking [14:40:27] ah snap this may be again the cfssl cert rotation [14:40:41] looks like the api server et al. were restarted, yeah [14:41:15] the issue is less frequent now that the VMs are bigger, but I am wondering if we should increase the probe time [14:41:27] (03PS3) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) [14:41:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [14:41:39] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1111 to cirrussearch1111 - bking@cumin2002" [14:41:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1111 to cirrussearch1111 - bking@cumin2002" [14:41:45] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:41:46] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1111 on all recursors [14:41:49] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1111 on all recursors [14:41:49] swfrench-wmf: elukey: thanks [14:41:50] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1111 [14:42:10] confirmed from the puppet-agent timer journal that this was indeed a cert rotation [14:42:44] RESOLVED: [2x] ProbeDown: Service aux-k8s-ctrl2003:6443 has failed probes (http_aux_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:54] elukey: in general I think increasing k8s apiserver probe times and/or allowed failures is a good idea [14:43:05] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1111 [14:43:23] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1156.eqiad.wmnet with reason: Harddrive replacement [14:43:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10796339 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3e41af22-1a28-4204-b34a-714e5f674f69) set b... [14:43:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1111 to cirrussearch1111 [14:44:25] also apiserver reload as opposed to restart is on the cards (e.g. a k8s upgrade away) or that's it ? [14:44:25] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1177.eqiad.wmnet with reason: Harddrive replacement [14:44:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10796347 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=99bfb834-e21a-4c4a-8267-639998e973bc) set b... [14:44:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1111.eqiad.wmnet with OS bullseye [14:44:56] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10796357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch1111.eqiad.w... [14:44:58] +1 to longer probe timeouts, particularly for single-host probes (e.g., vs. a probe at the service level) [14:45:16] so IIUC the default for prometheus::blackbox::check::http is 3s, that is really tight [14:45:39] godog: not sure if it can reload properly, at least in this version :( [14:45:59] also alert_after is 2m, tight as well [14:46:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P75786 and previous config saved to /var/cache/conftool/dbconfig/20250506-144615-ladsgroup.json [14:46:23] (03PS4) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [14:46:26] sending a patch [14:46:31] ack, SGTM [14:47:01] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:43] (03PS1) 10Elukey: profile::kubernetes::master: be more lenient for kube-api probes [puppet] - 10https://gerrit.wikimedia.org/r/1142618 [14:48:01] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:07] (03PS4) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) [14:48:32] (03CR) 10Elukey: "Lemme know if you want to tune values!" [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:48:36] sent :) [14:51:38] RECOVERY - Host db1246 #page is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:51:41] PROBLEM - SSH on db1246 is CRITICAL: connect to address 10.64.48.172 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:52:23] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:52:38] (03CR) 10CDanis: [C:03+1] profile::kubernetes::master: be more lenient for kube-api probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:53:04] (03PS3) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [14:53:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [14:53:44] (03PS5) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [14:54:27] (03PS2) 10Elukey: profile::kubernetes::master: be more lenient for kube-api probes [puppet] - 10https://gerrit.wikimedia.org/r/1142618 [14:54:31] (03CR) 10Elukey: profile::kubernetes::master: be more lenient for kube-api probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:55:12] (03PS3) 10Elukey: profile::kubernetes::master: be more lenient for kube-api probes [puppet] - 10https://gerrit.wikimedia.org/r/1142618 [14:55:36] (03CR) 10Elukey: profile::kubernetes::master: be more lenient for kube-api probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:56:00] (03CR) 10Filippo Giunchedi: [C:03+1] profile::kubernetes::master: be more lenient for kube-api probes [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:56:14] (03CR) 10Scott French: [C:03+1] "It looks like we don't have service-level probes for API servers (i.e., that would warrant a tighter threshold vs. host-level probes), but" [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:58:07] (03CR) 10Elukey: [C:03+2] profile::kubernetes::master: be more lenient for kube-api probes [puppet] - 10https://gerrit.wikimedia.org/r/1142618 (owner: 10Elukey) [14:58:14] thanks for the review folks! [14:58:24] (03PS6) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [14:58:33] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1111.eqiad.wmnet with reason: host reimage [14:58:59] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:00:05] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1500). [15:01:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P75787 and previous config saved to /var/cache/conftool/dbconfig/20250506-150122-ladsgroup.json [15:01:26] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate one image suggestions job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [15:02:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1111.eqiad.wmnet with reason: host reimage [15:02:26] (03PS7) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [15:04:17] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all image suggestions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [15:04:26] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate one image suggestions job to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140671 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [15:04:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [15:04:46] (03CR) 10Andrew Bogott: [C:03+2] nova policy.yaml: update with advice from oslopolicy-validator [puppet] - 10https://gerrit.wikimedia.org/r/1141978 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:04:48] (03CR) 10Andrew Bogott: [C:03+2] nova policy.json: remove a bunch of redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/1141979 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:04:51] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate purgeExpiredBlocks to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [15:06:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10796444 (10VRiley-WMF) Sure, I will look into this and commence with the relocation [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:48] (03CR) 10Scott French: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:08:25] (03PS2) 10Filippo Giunchedi: sre: alert on Prometheus codfw/eqiad down [alerts] - 10https://gerrit.wikimedia.org/r/1142543 (https://phabricator.wikimedia.org/T393365) [15:08:25] (03PS1) 10Filippo Giunchedi: sre: alert on webrequest-sampled not processed [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) [15:08:49] (03PS2) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) [15:08:57] (03CR) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:09:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:33] (03PS8) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [15:11:09] (03CR) 10Andrew Bogott: [C:03+2] glance: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1141980 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:11:12] (03CR) 10Andrew Bogott: [C:03+2] Cinder: explicitly use new policy rules [puppet] - 10https://gerrit.wikimedia.org/r/1141981 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:11:14] (03CR) 10Andrew Bogott: [C:03+2] cinder policy.yaml: update, remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/1141982 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:11:16] (03CR) 10Andrew Bogott: [C:03+2] Neutron: update policy rules [puppet] - 10https://gerrit.wikimedia.org/r/1141983 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:11:19] (03CR) 10Andrew Bogott: [C:03+2] Designate: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1141984 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:11:26] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:11:41] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:12:30] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate readinglists job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142568 (https://phabricator.wikimedia.org/T388541) (owner: 10Hnowlan) [15:14:28] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:28] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T382778)', diff saved to https://phabricator.wikimedia.org/P75788 and previous config saved to /var/cache/conftool/dbconfig/20250506-151629-ladsgroup.json [15:16:35] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [15:16:45] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1218.eqiad.wmnet with reason: Maintenance [15:16:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T382778)', diff saved to https://phabricator.wikimedia.org/P75789 and previous config saved to /var/cache/conftool/dbconfig/20250506-151652-ladsgroup.json [15:17:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1111.eqiad.wmnet with OS bullseye [15:17:08] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10796517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch1111.eqiad.wmnet... [15:18:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:19:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T382778)', diff saved to https://phabricator.wikimedia.org/P75790 and previous config saved to /var/cache/conftool/dbconfig/20250506-151946-ladsgroup.json [15:21:08] (03PS5) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) [15:21:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [15:23:53] (03CR) 10Scott French: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:24:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [15:26:13] (03CR) 10Andrew Bogott: [C:03+2] Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [15:28:00] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [15:28:35] jouncebot: nowandnext [15:28:35] For the next 0 hour(s) and 31 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1500) [15:28:35] In 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1600) [15:29:41] (03CR) 10Herron: "good point, I updated this to be its own auth type called local-api to avoid conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [15:30:10] (03CR) 10Bking: [C:03+2] "It's confirmed working. I don't have the cycles to work on the reverse DNS part but we can revisit after our OS migration if y'all still w" [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [15:31:05] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10796541 (10Stevemunene) a:03Stevemunene [15:31:43] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555#10796543 (10Stevemunene) a:03Stevemunene [15:33:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:34:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P75792 and previous config saved to /var/cache/conftool/dbconfig/20250506-153453-ladsgroup.json [15:36:18] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: increase buffering [puppet] - 10https://gerrit.wikimedia.org/r/1142625 [15:37:09] (03CR) 10Scott French: [C:03+2] mw::maintenance: update team for pagetriage jobs [puppet] - 10https://gerrit.wikimedia.org/r/1141946 (https://phabricator.wikimedia.org/T393395) (owner: 10Scott French) [15:38:25] PROBLEM - Host db1246 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:38:43] !incidents [15:38:43] 6095 (UNACKED) Host db1246 (paged) - PING - Packet loss = 100% [15:38:43] 6094 (RESOLVED) [2x] ProbeDown sre (aux-k8s-ctrl2003:6443 probes/custom codfw) [15:38:50] !ack 6095 [15:38:50] 6095 (ACKED) Host db1246 (paged) - PING - Packet loss = 100% [15:39:09] * swfrench-wmf checks to confirm it's still depooled [15:39:41] fwiw, I had mark this as resolved yesterday [15:39:45] the page from the weekend [15:40:14] still depooled [15:40:17] (03CR) 10Filippo Giunchedi: [C:03+1] benthos/mw_accesslog_metrics: increase buffering [puppet] - 10https://gerrit.wikimedia.org/r/1142625 (owner: 10Kamila Součková) [15:40:22] I think this is just a missing downtime [15:40:27] is this what the disable_notifications flag in Puppet is for? [15:40:29] I'll add one and follow up on the task [15:40:38] thanks swfrench-wmf [15:40:42] 06SRE, 10Observability-Metrics, 13Patch-For-Review: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10796611 (10herron) >>! In T391852#10796071, @elukey wrote: > @herron @RLazarus There are a couple of logistical things to discuss: > >... [15:40:51] cdanis: ah, perhaps that's a better option given that this host is cursed [15:41:03] I think it's a favorite of DBAs for cursed hosts [15:41:26] looks like T393296 tracks the latest iteration of this [15:41:26] T393296: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296 [15:41:29] if it does reboot randomly another option is to power it off :) [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:50] (03CR) 10Herron: [C:03+1] grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [15:42:26] (03CR) 10Herron: [C:03+1] grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [15:43:36] (03PS3) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [15:43:36] (03CR) 10Hnowlan: "I think there are enough of these types of job that we might need to entertain the idea of another migration flag (that we later remove) f" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:43:42] (03CR) 10Herron: [C:03+1] sre: alert on Prometheus codfw/eqiad down [alerts] - 10https://gerrit.wikimedia.org/r/1142543 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [15:43:48] (03CR) 10JHathaway: [C:03+2] Gemfile: update rspec-puppet to 2.10.x [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar) [15:44:40] (03CR) 10Hashar: "Thanks!!!" [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar) [15:45:17] !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host has crashed - T393296 [15:45:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10796642 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3fbee0e3-9e65-4eb8-8e55-5d123f8b4e20) set by swfrench@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with rea... [15:45:52] (03PS2) 10Kamila Součková: benthos/mw_accesslog_metrics: increase buffering [puppet] - 10https://gerrit.wikimedia.org/r/1142625 [15:47:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10796665 (10Scott_French) FYI, I've silenced notifications from this host for the next week, to avoid repeated pages while work is ongoing. These will need cleared if the host is returned to service ea... [15:48:09] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:48:10] (03PS3) 10CDanis: NetworkProbeLimit cookie: use SameSite=None [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) [15:48:16] (03CR) 10CDanis: [C:03+2] NetworkProbeLimit cookie: use SameSite=None [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) (owner: 10CDanis) [15:48:17] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:48:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:49:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10796672 (10VRiley-WMF) After speaking about this with @Papaul and @Jclark-ctr I will be relocating this server to D6, I will update the ticket once it's been moved with the new location and cableID [15:49:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:50:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P75793 and previous config saved to /var/cache/conftool/dbconfig/20250506-155000-ladsgroup.json [15:51:37] (03CR) 10Bernard Wang: Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [15:51:49] (03CR) 10CDanis: [V:03+2 C:03+2] NetworkProbeLimit cookie: use SameSite=None [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) (owner: 10CDanis) [15:53:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:51] RECOVERY - Host db1246 #page is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:54:47] (03CR) 10Herron: [C:03+1] "had a quick look at current state of these in the Pyrra UI, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1142596 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:00:04] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1600). [16:00:04] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:14] (03CR) 10Herron: "SGTM! Let's set the metadata namespace in pyrra to reflect something like "pilot" since it sounds like this one is not yet in steady stat" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [16:00:14] o/ [16:00:18] 👋 [16:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:12] (03CR) 10RLazarus: [C:03+2] SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [16:02:32] (03PS1) 10Ssingh: type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) [16:02:51] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: NetworkProbeLimit cookie rejected due to missing SameSite attribute - https://phabricator.wikimedia.org/T342624#10796746 (10CDanis) 05Open→03Resolved [16:03:38] (03CR) 10Herron: [C:03+1] sre: alert on webrequest-sampled not processed (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [16:03:48] dancy: hm, I can't ssh to prod for some reason -- one sec while I debug [16:03:59] ok [16:04:15] happy to merge and run puppet on a deploy host though, and I assume as long as puppet doesn't fail there's nothing for you to test [16:04:29] (03CR) 10Ssingh: "dig supports HTTPS records, so to test this, you will need to put them in a local gdnsd zone file and then query that using dig. I can pas" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [16:04:38] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:04:40] rzl: That's right. [16:04:58] rzl: I can verify that the spiderpig services are still running [16:05:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T382778)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250506-160507-ladsgroup.json [16:05:20] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [16:05:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:05:34] dancy: nod [16:05:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T382778)', diff saved to https://phabricator.wikimedia.org/P75795 and previous config saved to /var/cache/conftool/dbconfig/20250506-160535-ladsgroup.json [16:05:46] (03CR) 10Ssingh: "$ utils/type65.py -p 1 -t foo.example.org. --params 'alpn=h2,h3-19 ipv4hint=192.168.0.1'" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [16:07:38] (03PS2) 10Ssingh: type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) [16:08:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:38] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1142543 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [16:08:45] dancy: okay I'm in business, thanks for your patience [16:08:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T382778)', diff saved to https://phabricator.wikimedia.org/P75796 and previous config saved to /var/cache/conftool/dbconfig/20250506-160854-ladsgroup.json [16:09:15] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [16:09:22] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10796775 (10BTullis) [16:10:10] (03CR) 10Andrea Denisse: [C:03+2] grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:10:37] (03CR) 10FNegri: P:toolforge: Apply admin-root sudo policy to all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah) [16:15:43] dancy: puppet's done on deploy1003 [16:16:34] rzl: Everything looks good. Thanks Reuven. [16:18:24] (03CR) 10CI reject: [V:04-1] type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [16:20:00] (03Abandoned) 10Clare Ming: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136032 (owner: 10Clare Ming) [16:20:55] (03PS3) 10Ssingh: type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) [16:20:59] (03PS1) 10Btullis: Add jvanderhoop to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1142632 (https://phabricator.wikimedia.org/T393409) [16:22:21] (03CR) 10CI reject: [V:04-1] Add jvanderhoop to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1142632 (https://phabricator.wikimedia.org/T393409) (owner: 10Btullis) [16:23:37] (03PS2) 10Btullis: Add jvanderhoop to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1142632 (https://phabricator.wikimedia.org/T393409) [16:24:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P75797 and previous config saved to /var/cache/conftool/dbconfig/20250506-162401-ladsgroup.json [16:26:57] (03PS4) 10Ssingh: type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) [16:28:02] (03CR) 10Andrea Denisse: [C:03+2] grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:28:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10796862 (10VRiley-WMF) New location for this server is D6 U22 CableID 5166 port 26 [16:33:33] !log cdanis@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin1002" [16:33:34] !log cdanis@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin1002 [16:34:01] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin1002 [16:34:02] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin1002" [16:34:15] !log enable Puppet on Grafana2001 - T384841 [16:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:17] T384841: Upgrade to Grafana 11 - https://phabricator.wikimedia.org/T384841 [16:34:41] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1141548 (owner: 10Andrew Bogott) [16:35:48] (03PS1) 10Eevans: WIP: JBOD partman recipe for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544) [16:36:05] (03CR) 10Andrew Bogott: [C:03+2] ceph_disks.rb: don't try to strip an int [puppet] - 10https://gerrit.wikimedia.org/r/1141548 (owner: 10Andrew Bogott) [16:36:50] (03CR) 10Alexandros Kosiaris: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [16:39:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P75798 and previous config saved to /var/cache/conftool/dbconfig/20250506-163908-ladsgroup.json [16:53:07] (03CR) 10Ayounsi: [C:03+1] "Awesome, thanks a lot ! I gave it a try with and without the related homer-deploy CR and it works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [16:54:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T382778)', diff saved to https://phabricator.wikimedia.org/P75799 and previous config saved to /var/cache/conftool/dbconfig/20250506-165415-ladsgroup.json [16:54:18] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [16:54:30] (03CR) 10CDanis: [C:03+1] Fastnetmon bump threshold_mbps to 8Gbps [puppet] - 10https://gerrit.wikimedia.org/r/1140689 (https://phabricator.wikimedia.org/T311005) (owner: 10Ayounsi) [16:54:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1232.eqiad.wmnet with reason: Maintenance [16:54:33] (03PS1) 10Bking: cirrussearch: move net-new hosts into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) [16:54:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T382778)', diff saved to https://phabricator.wikimedia.org/P75800 and previous config saved to /var/cache/conftool/dbconfig/20250506-165438-ladsgroup.json [16:55:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [16:55:37] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.wikimedia.org with OS bookworm [16:55:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10797012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm [16:57:08] (03CR) 10Ayounsi: [C:03+2] Fastnetmon bump threshold_mbps to 8Gbps [puppet] - 10https://gerrit.wikimedia.org/r/1140689 (https://phabricator.wikimedia.org/T311005) (owner: 10Ayounsi) [16:57:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T382778)', diff saved to https://phabricator.wikimedia.org/P75801 and previous config saved to /var/cache/conftool/dbconfig/20250506-165752-ladsgroup.json [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1700) [17:03:07] (03PS1) 10JHathaway: run_ci_locally.sh: fmt with shfmt [puppet] - 10https://gerrit.wikimedia.org/r/1142639 [17:04:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:04:44] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate readinglists job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142568 (https://phabricator.wikimedia.org/T388541) (owner: 10Hnowlan) [17:06:55] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1142539 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi) [17:11:50] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:12:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:12:56] PROBLEM - SSH on netbox1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:13:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P75802 and previous config saved to /var/cache/conftool/dbconfig/20250506-171259-ladsgroup.json [17:13:47] (03PS1) 10HMonroy: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) [17:13:59] FIRING: [22x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:16:42] FIRING: JobUnavailable: Reduced availability for job netbox_global in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:20:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [17:21:33] (03PS2) 10AOkoth: wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) [17:22:16] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate purgeExpiredBlocks to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [17:22:38] (03CR) 10CI reject: [V:04-1] wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [17:24:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:24:46] RECOVERY - SSH on netbox1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:26:55] FIRING: [22x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P75803 and previous config saved to /var/cache/conftool/dbconfig/20250506-172807-ladsgroup.json [17:28:59] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:28] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [17:29:40] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 11s) [17:29:50] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1017 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:30:12] jouncebot: nowandnext [17:30:12] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1700) [17:30:13] In 0 hour(s) and 29 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1800) [17:30:14] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1017 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:30:26] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:30:30] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1017 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:30:42] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:31:46] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling source-only afterwards [17:31:48] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [17:31:55] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:56] (03CR) 10Bking: wdqs-main: bring old internal hosts into service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [17:33:44] (03CR) 10Ryan Kemper: [C:03+2] wdqs: fix data loaded flag for wdqs-all option [cookbooks] - 10https://gerrit.wikimedia.org/r/1142070 (owner: 10Ryan Kemper) [17:34:06] is it just me or is gerrit having a bad time? [17:34:21] I get 20s+ page load times [17:34:23] tgr_ you're not alone. There's some chatter about it in #security [17:34:25] It's not just you [17:34:44] I've been saying it for like 1 hour, so at this point I expected it was known / being worked on [17:35:41] (03CR) 10Ryan Kemper: [C:03+2] wdqs-main: bring old internal hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [17:36:42] RESOLVED: JobUnavailable: Reduced availability for job netbox_global in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:37:07] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:59] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet, repooling source-only afterwards [17:39:27] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [17:39:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:14] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling source-only afterwards [17:40:15] (03PS2) 10MacFan4000: ExtensionDistributor: Mark 1.44 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142646 (https://phabricator.wikimedia.org/T390794) [17:40:59] (03CR) 10CI reject: [V:04-1] ExtensionDistributor: Mark 1.44 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142646 (https://phabricator.wikimedia.org/T390794) (owner: 10MacFan4000) [17:41:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling source-only afterwards [17:41:20] (03Merged) 10jenkins-bot: wdqs: fix data loaded flag for wdqs-all option [cookbooks] - 10https://gerrit.wikimedia.org/r/1142070 (owner: 10Ryan Kemper) [17:41:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T393066#10797241 (10Milimetric) Approved as well, for the `analytics-privatedata-users` group, as per [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/pr... [17:41:55] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:25] (03PS3) 10MacFan4000: ExtensionDistributor: Mark 1.44 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142646 (https://phabricator.wikimedia.org/T390794) [17:42:45] (03PS1) 10Tchanders: Assign IP auto-reveal rights to certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) [17:43:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T382778)', diff saved to https://phabricator.wikimedia.org/P75804 and previous config saved to /var/cache/conftool/dbconfig/20250506-174313-ladsgroup.json [17:43:17] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [17:43:18] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1234.eqiad.wmnet with reason: Maintenance [17:43:20] (03CR) 10Tchanders: [C:04-2] "Needs some discussion and possibly instrumentation to go first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) (owner: 10Tchanders) [17:43:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T382778)', diff saved to https://phabricator.wikimedia.org/P75805 and previous config saved to /var/cache/conftool/dbconfig/20250506-174325-ladsgroup.json [17:43:32] (03CR) 10CI reject: [V:04-1] Assign IP auto-reveal rights to certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) (owner: 10Tchanders) [17:43:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to for - https://phabricator.wikimedia.org/T393066#10797257 (10Ahoelzl) [17:43:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet, repooling source-only afterwards [17:43:59] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling source-only afterwards [17:44:44] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [17:44:44] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet, repooling source-only afterwards [17:44:58] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [17:45:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Requesting access to for - https://phabricator.wikimedia.org/T393066#10797266 (10BTullis) a:03BTullis I'll pick up the puppet change for this ticket. [17:46:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T382778)', diff saved to https://phabricator.wikimedia.org/P75806 and previous config saved to /var/cache/conftool/dbconfig/20250506-174639-ladsgroup.json [17:46:55] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:30] (03PS2) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) [17:48:59] FIRING: [14x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:04] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet, repooling source-only afterwards [17:49:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling source-only afterwards [17:49:38] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2015.codfw.wmnet, repooling source-only afterwards [17:49:43] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling source-only afterwards [17:49:46] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [17:51:50] (03CR) 10Tchanders: [C:04-2] Assign IP auto-reveal rights to certain groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) (owner: 10Tchanders) [17:54:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling source-only afterwards [17:54:10] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2015.codfw.wmnet, repooling source-only afterwards [17:55:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10797343 (10Papaul) thank you @VRiley-WMF the server is up in new rack i will be doing some HW test. [17:56:49] (03CR) 10Bking: [C:03+1] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [17:56:55] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T1800) [18:01:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P75807 and previous config saved to /var/cache/conftool/dbconfig/20250506-180146-ladsgroup.json [18:02:56] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142654 (https://phabricator.wikimedia.org/T386223) [18:02:57] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142654 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [18:04:10] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142654 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [18:04:47] (03CR) 10Ssingh: "To make reviewing easier without reading the RFC, output from the script and a gdnsd zone file:" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [18:09:26] (03CR) 10Ssingh: "Ready for review; feel free to go through the RFC as well (https://www.rfc-editor.org/rfc/rfc9460.pdf) or just review the gdnsd and dig ou" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [18:12:09] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [18:12:12] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [18:12:20] (03CR) 10Neriah: [C:03+1] Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:13:30] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [18:16:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P75808 and previous config saved to /var/cache/conftool/dbconfig/20250506-181652-ladsgroup.json [18:17:10] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.28 refs T386223 [18:17:12] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [18:23:35] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [18:23:38] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [18:23:59] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [18:25:17] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [18:25:43] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [18:29:39] (03CR) 10JHathaway: [C:03+2] run_ci_locally.sh: fmt with shfmt [puppet] - 10https://gerrit.wikimedia.org/r/1142639 (owner: 10JHathaway) [18:31:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T382778)', diff saved to https://phabricator.wikimedia.org/P75810 and previous config saved to /var/cache/conftool/dbconfig/20250506-183159-ladsgroup.json [18:32:02] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [18:32:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1235.eqiad.wmnet with reason: Maintenance [18:32:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T382778)', diff saved to https://phabricator.wikimedia.org/P75811 and previous config saved to /var/cache/conftool/dbconfig/20250506-183222-ladsgroup.json [18:35:09] (03CR) 10Arlolra: [C:03+2] ExtensionDistributor: Mark 1.44 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142646 (https://phabricator.wikimedia.org/T390794) (owner: 10MacFan4000) [18:35:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T382778)', diff saved to https://phabricator.wikimedia.org/P75812 and previous config saved to /var/cache/conftool/dbconfig/20250506-183533-ladsgroup.json [18:38:30] (03Merged) 10jenkins-bot: ExtensionDistributor: Mark 1.44 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142646 (https://phabricator.wikimedia.org/T390794) (owner: 10MacFan4000) [18:46:55] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:47:04] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:13] FIRING: [6x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P75813 and previous config saved to /var/cache/conftool/dbconfig/20250506-185040-ladsgroup.json [19:04:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P75814 and previous config saved to /var/cache/conftool/dbconfig/20250506-190547-ladsgroup.json [19:09:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142668 [19:18:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:18:42] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [19:18:59] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:04] FIRING: [10x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:17] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:20:12] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:20:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T382778)', diff saved to https://phabricator.wikimedia.org/P75815 and previous config saved to /var/cache/conftool/dbconfig/20250506-192054-ladsgroup.json [19:20:58] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [19:21:02] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling neither afterwards [19:21:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:22:17] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:23:26] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1251.eqiad.wmnet with reason: Maintenance [19:23:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T382778)', diff saved to https://phabricator.wikimedia.org/P75816 and previous config saved to /var/cache/conftool/dbconfig/20250506-192333-ladsgroup.json [19:23:35] (03CR) 10AOkoth: [C:03+2] vrts: add junk queue count and remove mobile queue [puppet] - 10https://gerrit.wikimedia.org/r/1140207 (https://phabricator.wikimedia.org/T389079) (owner: 10AOkoth) [19:23:35] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:23:59] FIRING: [9x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:08] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:47] (03CR) 10AOkoth: [C:03+2] aux: add namespace for os-reports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142606 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [19:25:18] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:25:20] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [19:25:35] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [19:25:59] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142668 (owner: 10PipelineBot) [19:26:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T382778)', diff saved to https://phabricator.wikimedia.org/P75817 and previous config saved to /var/cache/conftool/dbconfig/20250506-192624-ladsgroup.json [19:26:27] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [19:27:36] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142668 (owner: 10PipelineBot) [19:37:24] (03PS1) 10Jdlrobson: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) [19:38:30] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [19:38:55] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [19:40:32] (03PS1) 10CDanis: gerrit: block addl IP range [puppet] - 10https://gerrit.wikimedia.org/r/1142672 [19:41:25] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [19:41:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P75818 and previous config saved to /var/cache/conftool/dbconfig/20250506-194131-ladsgroup.json [19:42:03] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [19:42:28] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [19:42:48] (03CR) 10Scott French: [C:03+1] gerrit: block addl IP range [puppet] - 10https://gerrit.wikimedia.org/r/1142672 (owner: 10CDanis) [19:43:05] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [19:44:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [19:44:08] (03CR) 10CDanis: [C:03+2] gerrit: block addl IP range [puppet] - 10https://gerrit.wikimedia.org/r/1142672 (owner: 10CDanis) [19:46:27] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.wikimedia.org with OS bookworm [19:46:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10797701 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm [19:47:46] (03PS1) 10CDanis: gerrit: fix new blocked ranges [puppet] - 10https://gerrit.wikimedia.org/r/1142673 [19:49:18] (03CR) 10Scott French: [C:03+1] gerrit: fix new blocked ranges [puppet] - 10https://gerrit.wikimedia.org/r/1142673 (owner: 10CDanis) [19:49:28] (03CR) 10CDanis: [C:03+2] gerrit: fix new blocked ranges [puppet] - 10https://gerrit.wikimedia.org/r/1142673 (owner: 10CDanis) [19:50:32] (03PS1) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675 [19:55:29] (03CR) 10JHathaway: "I've tested an alternate approach, which doesn't require a rootfull container, would love if you could take a look, https://gerrit.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [19:56:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P75819 and previous config saved to /var/cache/conftool/dbconfig/20250506-195638-ladsgroup.json [19:56:40] (03CR) 10Bking: "The PCC runs are failing because there are no facts for cirrussearch1111. I followed the PCC update process as described at https://w.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:59] (03CR) 10JHathaway: "@ltoscano@wikimedia.org & @hashar@free.fr here is an alternative approach that does not require a rootful container." [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [20:04:15] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: move net-new hosts into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [20:05:24] (03CR) 10Btullis: [C:03+2] Add jvanderhoop to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1142632 (https://phabricator.wikimedia.org/T393409) (owner: 10Btullis) [20:07:01] (03CR) 10Bking: [C:03+2] cirrussearch: move net-new hosts into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [20:10:01] (03PS1) 10Btullis: Add scampos to the analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/1142679 (https://phabricator.wikimedia.org/T393066) [20:11:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T382778)', diff saved to https://phabricator.wikimedia.org/P75820 and previous config saved to /var/cache/conftool/dbconfig/20250506-201145-ladsgroup.json [20:11:49] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [20:12:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:12:50] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling neither afterwards [20:12:53] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:13:06] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2141.codfw.wmnet with reason: Maintenance [20:13:35] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [20:13:59] FIRING: [13x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting access to for - https://phabricator.wikimedia.org/T393066#10797802 (10BTullis) [20:14:04] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:14] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2145.codfw.wmnet with reason: Maintenance [20:14:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T382778)', diff saved to https://phabricator.wikimedia.org/P75821 and previous config saved to /var/cache/conftool/dbconfig/20250506-201421-ladsgroup.json [20:16:55] RESOLVED: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:16:59] FIRING: [17x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T382778)', diff saved to https://phabricator.wikimedia.org/P75822 and previous config saved to /var/cache/conftool/dbconfig/20250506-201744-ladsgroup.json [20:17:48] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [20:18:03] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [20:18:05] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:18:59] FIRING: [22x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting access to for - https://phabricator.wikimedia.org/T393066#10797819 (10BTullis) I have prepared the puppet change for this in: https://gerrit.wikimedia.org/r/c/operations/puppet/+/11... [20:19:30] (03CR) 10Btullis: [C:03+1] cirrussearch: move net-new hosts into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1142636 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [20:24:25] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10797841 (10Jclark-ctr) @Andrew can this be closed out if nothing else is needed? [20:24:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:24:29] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:25:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1111-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:26:55] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:26:59] FIRING: [27x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:11] (03PS1) 10Andrew Bogott: Remove oslo_policy section [puppet] - 10https://gerrit.wikimedia.org/r/1142683 (https://phabricator.wikimedia.org/T330759) [20:27:22] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035#10797869 (10Jclark-ctr) @Eevans is this still needed? or can it be resolved? [20:27:55] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1112 to cirrussearch1112 [20:28:20] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:28:25] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [20:28:31] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:30:39] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1099-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [20:31:55] FIRING: [17x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P75823 and previous config saved to /var/cache/conftool/dbconfig/20250506-203251-ladsgroup.json [20:32:55] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:52] (03CR) 10Andrew Bogott: [C:03+2] Remove oslo_policy section [puppet] - 10https://gerrit.wikimedia.org/r/1142683 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:33:53] bking@cumin2002 rename (PID 426087) is awaiting input [20:35:39] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1099-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [20:36:08] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1112 to cirrussearch1112 - bking@cumin2002" [20:36:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1112 to cirrussearch1112 - bking@cumin2002" [20:36:29] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:36:29] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1112 on all recursors [20:36:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1112 on all recursors [20:36:33] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1112 [20:36:56] (03CR) 10JHathaway: [C:03+1] "looks good, couple of minor suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [20:37:53] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1112 [20:38:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1112 to cirrussearch1112 [20:39:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [20:39:33] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:39:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1016.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:39:56] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [20:40:09] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1017.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:40:44] !log andrew@cumin1002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cloudrabbit2001-dev.codfw.wmnet: Renew puppet certificate - andrew@cumin1002 [20:41:55] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:41:59] FIRING: [16x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:44] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1113 to cirrussearch1113 [20:43:08] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:43:24] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm [20:43:59] FIRING: [14x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:33] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm [20:45:16] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm [20:47:21] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:47:51] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:48:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P75824 and previous config saved to /var/cache/conftool/dbconfig/20250506-204758-ladsgroup.json [20:48:17] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:48:17] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:48:46] bking@cumin2002 rename (PID 440527) is awaiting input [20:50:35] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:51:15] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:51:17] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 8.807 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:51:17] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 6.077 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:51:25] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:51:41] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:51:55] FIRING: [13x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs2021.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [20:52:24] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:52:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1017.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:53:57] (03PS1) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) [20:53:59] FIRING: [18x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:17] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:54:21] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:54:37] (03CR) 10CI reject: [V:04-1] wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [20:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10797966 (10phaultfinder) [20:54:51] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:55:19] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:56:00] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1112.eqiad.wmnet with OS bullseye [20:56:17] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:56:35] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:56:44] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1113 to cirrussearch1113 - bking@cumin2002" [20:56:49] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:57:34] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1113 to cirrussearch1113 - bking@cumin2002" [20:57:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:57:36] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1113 on all recursors [20:57:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1113 on all recursors [20:57:39] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1113 [20:58:19] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 8.415 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:58:31] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:58:59] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1113 [20:59:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1113 to cirrussearch1113 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250506T2100) [21:00:42] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1113.eqiad.wmnet with OS bullseye [21:01:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:39] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [21:01:52] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:02:14] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 565 bytes in 5.513 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:02:23] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=wdqs1011.eqiad.wmnet|wdqs1016.eqiad.wmnet|wdqs1017.eqiad.wmnet|wdqs2008.codfw.wmnet|wdqs2014.codfw.wmnet|wdqs2015.codfw.wmnet [21:02:42] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:03:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T382778)', diff saved to https://phabricator.wikimedia.org/P75825 and previous config saved to /var/cache/conftool/dbconfig/20250506-210307-ladsgroup.json [21:03:10] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [21:03:14] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1114 to cirrussearch1114 [21:03:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2146.codfw.wmnet with reason: Maintenance [21:03:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T382778)', diff saved to https://phabricator.wikimedia.org/P75826 and previous config saved to /var/cache/conftool/dbconfig/20250506-210329-ladsgroup.json [21:03:37] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:03:43] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [21:03:48] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [21:05:14] (03PS1) 10Andrew Bogott: codfw1dev rabbit config: remove a comment that is no longer true [puppet] - 10https://gerrit.wikimedia.org/r/1142687 (https://phabricator.wikimedia.org/T392539) [21:05:15] (03PS1) 10Andrew Bogott: codfw1dev: remove rabbitmq from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142688 (https://phabricator.wikimedia.org/T392539) [21:05:24] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [21:06:41] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.wikimedia.org with OS bookworm [21:06:45] (03CR) 10Ryan Kemper: [C:03+2] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [21:06:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10798031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors: - apus-... [21:06:55] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:06:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T382778)', diff saved to https://phabricator.wikimedia.org/P75827 and previous config saved to /var/cache/conftool/dbconfig/20250506-210658-ladsgroup.json [21:07:13] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142689 [21:07:27] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1112.eqiad.wmnet with reason: host reimage [21:08:27] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [21:08:59] FIRING: [13x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:17] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1114 to cirrussearch1114 - bking@cumin2002" [21:11:37] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142689 (owner: 10PipelineBot) [21:12:07] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1113.eqiad.wmnet with reason: host reimage [21:12:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1112.eqiad.wmnet with reason: host reimage [21:12:09] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1114 to cirrussearch1114 - bking@cumin2002" [21:12:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:12:09] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1114 on all recursors [21:12:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1114 on all recursors [21:12:14] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1114 [21:13:59] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:31] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142689 (owner: 10PipelineBot) [21:15:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:15:17] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1114 [21:15:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1113.eqiad.wmnet with reason: host reimage [21:15:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1114 to cirrussearch1114 [21:16:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:32] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [21:16:53] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [21:16:57] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1115 to cirrussearch1115 [21:17:19] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [21:17:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:17:51] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [21:18:09] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [21:18:45] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [21:20:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [21:20:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:20:25] bking@cumin2002 reimage (PID 476786) is awaiting input [21:22:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P75828 and previous config saved to /var/cache/conftool/dbconfig/20250506-212204-ladsgroup.json [21:23:28] !log T388134 Cutover of query.wikidata.org to `wdqs-main` instead of `wdqs` is ongoing. We're seeing the expected drop in queries to the main cluster (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1746565806937&to=1746566592047) but not seeing corresponding increase in wdqs-main yet [21:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:32] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [21:23:32] (03PS1) 10MusikAnimal: CodeMirror: temporarily disable linting for wikitext [extensions/CodeMirror] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142692 (https://phabricator.wikimedia.org/T381577) [21:24:05] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1115 to cirrussearch1115 - bking@cumin2002" [21:24:54] !incidents [21:24:54] 6095 (ACKED) Host db1246 (paged) - PING - Packet loss = 100% [21:24:54] 6094 (RESOLVED) [2x] ProbeDown sre (aux-k8s-ctrl2003:6443 probes/custom codfw) [21:25:03] !resolve 6095 [21:25:03] 6095 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [21:25:45] ^ resolving 6095 to avoid re-notification 24h later [21:26:25] (03PS1) 10Ebernhardson: WIP: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 [21:26:41] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [21:26:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:06] (03PS5) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) [21:27:11] bking@cumin2002 rename (PID 476422) is awaiting input [21:27:12] (03CR) 10Máté Szabó: Unify IPInfo access levels (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [21:27:27] (03CR) 10CI reject: [V:04-1] Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [21:28:16] !log T388134 Seeing 502 errors; that explains why the drop in requests to wdqs-full is not matched by an increase to wdqs-main. Rolling back for now while we figure out what piece we're missing [21:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:34] (03PS1) 10Ryan Kemper: Revert "wdqs: route query.wd.org to wdqs-main" [puppet] - 10https://gerrit.wikimedia.org/r/1142694 [21:28:43] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] Revert "wdqs: route query.wd.org to wdqs-main" [puppet] - 10https://gerrit.wikimedia.org/r/1142694 (owner: 10Ryan Kemper) [21:29:13] (03CR) 10CI reject: [V:04-1] WIP: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [21:30:25] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:56] (03PS2) 10Ebernhardson: WIP: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 [21:33:06] FIRING: SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:35:07] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [21:35:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1112.eqiad.wmnet with OS bullseye [21:36:45] (03PS1) 10Andrew Bogott: rabbitmq: add hiera role config for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1142697 (https://phabricator.wikimedia.org/T392539) [21:37:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P75829 and previous config saved to /var/cache/conftool/dbconfig/20250506-213712-ladsgroup.json [21:38:06] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:38:33] (03PS2) 10Bvibber: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) (owner: 10Jdlrobson) [21:38:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1113.eqiad.wmnet with OS bullseye [21:38:56] (03PS1) 10Bvibber: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) [21:39:15] (03CR) 10Andrew Bogott: [C:03+2] rabbitmq: add hiera role config for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1142697 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [21:40:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) (owner: 10Jdlrobson) [21:40:25] RESOLVED: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:40:28] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm [21:40:29] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm [21:40:31] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm [21:40:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) (owner: 10Bvibber) [21:41:07] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm [21:41:16] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm [21:41:20] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm [21:43:06] RESOLVED: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:43:12] (03PS1) 10MusikAnimal: InitialiseSettings: enable multiblocks on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142699 (https://phabricator.wikimedia.org/T377121) [21:43:29] (03PS3) 10Ebernhardson: WIP: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 [21:44:44] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5469/co" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [21:45:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1111-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:51:44] (03CR) 10Jdlrobson: [C:03+1] Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) (owner: 10Bvibber) [21:52:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T382778)', diff saved to https://phabricator.wikimedia.org/P75830 and previous config saved to /var/cache/conftool/dbconfig/20250506-215219-ladsgroup.json [21:52:22] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [21:52:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2153.codfw.wmnet with reason: Maintenance [21:52:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T382778)', diff saved to https://phabricator.wikimedia.org/P75831 and previous config saved to /var/cache/conftool/dbconfig/20250506-215242-ladsgroup.json [21:53:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:54:56] (03PS4) 10Ebernhardson: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 [21:55:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T382778)', diff saved to https://phabricator.wikimedia.org/P75832 and previous config saved to /var/cache/conftool/dbconfig/20250506-215549-ladsgroup.json [21:56:20] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5470/co" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [21:58:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:59:23] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [21:59:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:59:56] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [22:00:00] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [22:01:11] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1115 to cirrussearch1115 - bking@cumin2002" [22:01:11] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:11] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1115 on all recursors [22:01:15] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1115 on all recursors [22:01:16] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1115 [22:02:07] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [22:02:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1113.eqiad.wmnet with OS bullseye [22:02:52] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1115 [22:03:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1115 to cirrussearch1115 [22:03:41] (03PS1) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [22:03:46] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:26] (03CR) 10CI reject: [V:04-1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [22:04:28] (03CR) 10Ebernhardson: [V:03+1] "Based on PCC it seems reasonably likely this would fix our problem where the newly defined (Ie6dfb586f6) discovery services on 630[234] al" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [22:04:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [22:05:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [22:08:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:10:19] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [22:10:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P75833 and previous config saved to /var/cache/conftool/dbconfig/20250506-221056-ladsgroup.json [22:13:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1113.eqiad.wmnet with reason: host reimage [22:14:03] (03Abandoned) 10MusikAnimal: CodeMirror: temporarily disable linting for wikitext [extensions/CodeMirror] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142692 (https://phabricator.wikimedia.org/T381577) (owner: 10MusikAnimal) [22:17:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1113.eqiad.wmnet with reason: host reimage [22:21:12] (03CR) 10Novem Linguae: Charts phase 1 deployment (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [22:21:38] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm [22:25:33] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm [22:26:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P75834 and previous config saved to /var/cache/conftool/dbconfig/20250506-222603-ladsgroup.json [22:29:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm [22:32:57] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1114.eqiad.wmnet with OS bullseye [22:34:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1115.eqiad.wmnet with OS bullseye [22:34:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1113.eqiad.wmnet with OS bullseye [22:41:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T382778)', diff saved to https://phabricator.wikimedia.org/P75835 and previous config saved to /var/cache/conftool/dbconfig/20250506-224110-ladsgroup.json [22:41:13] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [22:41:26] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:41:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T382778)', diff saved to https://phabricator.wikimedia.org/P75836 and previous config saved to /var/cache/conftool/dbconfig/20250506-224132-ladsgroup.json [22:44:19] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1114.eqiad.wmnet with reason: host reimage [22:44:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T382778)', diff saved to https://phabricator.wikimedia.org/P75837 and previous config saved to /var/cache/conftool/dbconfig/20250506-224440-ladsgroup.json [22:45:11] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1115.eqiad.wmnet with reason: host reimage [22:48:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1114.eqiad.wmnet with reason: host reimage [22:51:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1115.eqiad.wmnet with reason: host reimage [22:57:37] (03PS2) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [22:58:24] (03CR) 10CI reject: [V:04-1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [22:59:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P75838 and previous config saved to /var/cache/conftool/dbconfig/20250506-225947-ladsgroup.json [23:09:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142699 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [23:10:38] (03Merged) 10jenkins-bot: InitialiseSettings: enable multiblocks on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142699 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [23:14:39] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1057-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [23:14:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P75839 and previous config saved to /var/cache/conftool/dbconfig/20250506-231454-ladsgroup.json [23:15:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1114.eqiad.wmnet with OS bullseye [23:15:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1112-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:16:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [23:17:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [23:17:31] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:17:31] FIRING: [3x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1075:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:18:01] Hola! I'm in the middle of deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1142699 and I'm getting: [23:18:08] https://www.irccloud.com/pastebin/HDgNfHH0/ [23:18:30] first time getting this message. Should I proceed? [23:18:46] just one commit then [23:19:39] FIRING: [2x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1057-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [23:19:40] yup [23:19:45] I'm reviewing it [23:19:59] ty! [23:20:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1115.eqiad.wmnet with OS bullseye [23:21:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [23:22:20] ok, deploy it and I'll test it [23:22:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [23:22:26] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:22:31] FIRING: [8x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:22:39] k [23:22:49] !log hmonroy@deploy1003 Started scap sync-world: Backport for [[gerrit:1142699|InitialiseSettings: enable multiblocks on group0 (T377121)]] [23:22:52] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:24:19] I got my staff rights so I can test, too [23:24:34] I'm planning on testing the ExtensionDistributor change [23:24:39] RESOLVED: [2x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1057-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [23:24:40] ohh I see [23:27:08] (03PS1) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) [23:27:28] RESOLVED: [8x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:29:30] !log hmonroy@deploy1003 musikanimal, hmonroy: Backport for [[gerrit:1142699|InitialiseSettings: enable multiblocks on group0 (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:29:33] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:30:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T382778)', diff saved to https://phabricator.wikimedia.org/P75840 and previous config saved to /var/cache/conftool/dbconfig/20250506-233002-ladsgroup.json [23:30:04] ExtensionDistributor looks good on test servers [23:30:05] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [23:30:06] looks good 👍 [23:30:18] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:30:20] nice! ok proceeding [23:30:24] !log hmonroy@deploy1003 musikanimal, hmonroy: Continuing with sync [23:30:34] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:30:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T382778)', diff saved to https://phabricator.wikimedia.org/P75841 and previous config saved to /var/cache/conftool/dbconfig/20250506-233041-ladsgroup.json [23:31:16] (03CR) 10CI reject: [V:04-1] Removing WM Enterprise downloader Puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [23:31:48] (03PS1) 10MusikAnimal: Revert "JavaScript: ESLint 8.57.0" [extensions/CodeMirror] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142714 (https://phabricator.wikimedia.org/T381577) [23:32:53] while we're here… could we possibly get https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CodeMirror/+/1142714 backported? this reverts some unfinished work from the Hackathon [23:33:23] (03PS2) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) [23:33:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T382778)', diff saved to https://phabricator.wikimedia.org/P75842 and previous config saved to /var/cache/conftool/dbconfig/20250506-233339-ladsgroup.json [23:35:33] (03PS1) 10Zabe: SkinTemplate: Restore a string 'class' in tabAction() [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142715 (https://phabricator.wikimedia.org/T393504) [23:37:07] !log hmonroy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142699|InitialiseSettings: enable multiblocks on group0 (T377121)]] (duration: 14m 17s) [23:37:10] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:37:28] k, done [23:37:49] do you want to do the next one @musikanimal ? [23:38:11] yes please :) [23:39:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142716 [23:39:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142716 (owner: 10TrainBranchBot) [23:39:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1003 using scap backport" [extensions/CodeMirror] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142714 (https://phabricator.wikimedia.org/T381577) (owner: 10MusikAnimal) [23:41:41] (03CR) 10BryanDavis: [C:03+1] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar) [23:41:45] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 214953160 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:42:45] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 124608 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:48:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P75843 and previous config saved to /var/cache/conftool/dbconfig/20250506-234846-ladsgroup.json [23:51:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142716 (owner: 10TrainBranchBot) [23:51:53] (03Merged) 10jenkins-bot: Revert "JavaScript: ESLint 8.57.0" [extensions/CodeMirror] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142714 (https://phabricator.wikimedia.org/T381577) (owner: 10MusikAnimal) [23:52:33] !log hmonroy@deploy1003 Started scap sync-world: Backport for [[gerrit:1142714|Revert "JavaScript: ESLint 8.57.0" (T381577)]] [23:52:36] T381577: Highlighting of syntax errors, warnings, infos for Wikitext editor - https://phabricator.wikimedia.org/T381577