[00:01:08] (03Abandoned) 10Dzahn: Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn) [00:01:29] (03CR) 10Dzahn: "kind of reverting in https://gerrit.wikimedia.org/r/c/operations/puppet/+/910104" [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh) [00:02:08] !log LDAP - adding uid fnavas-foundation to group wmf - T331482 [00:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:14] T331482: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 [00:02:59] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Dzahn) [00:03:29] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Dzahn) [00:04:51] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Dzahn) @FNavas-foundation Can you please try now and tell me if it works... [00:10:10] RECOVERY - IPMI Sensor Status on ganeti2019 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [00:11:00] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [00:11:03] 10SRE, 10ops-codfw: Broken PSU on ganeti2019 - https://phabricator.wikimedia.org/T335026 (10Jhancock.wm) 05Open→03Resolved replaced the power cord with the new ones. alert has cleared in idrac. The input power for power supply 1 has been restored. Thu Apr 20 2023 00:06:44 The power supplies are redunda... [00:13:54] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Dzahn) @Clement_Goubert I don't want to paste it for the user, but see the output of: ` [mwmaint1002:~] $ ldapsearch -x uid=andrewtavis-wmde | grep mail ` Note there is uid=andr... [00:16:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Cmjohnson I did some debugging with @papaul online and we think there may be a crossed cable issue between frbast1002 and frpig1002. Coul... [00:32:24] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:37:00] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910080 [00:39:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910080 (owner: 10TrainBranchBot) [00:54:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910080 (owner: 10TrainBranchBot) [00:56:42] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Jdforrester-WMF) [01:01:12] (03PS1) 10MusikAnimal: interwiki: update URL to XTools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 [01:07:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:09:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:22:07] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Dzahn) a:03Dzahn [01:24:39] (03CR) 10Dzahn: "also checked that other contractors are in the wmf group." [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482) (owner: 10Dzahn) [01:27:32] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [01:30:30] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [01:47:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [01:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:48:30] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:01:47] (03PS3) 10Andrea Denisse: prometheus: Added support for syncing data between instances [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [02:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:01] (03PS4) 10Andrea Denisse: prometheus: Added support for syncing data between instances [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [02:19:13] (03PS5) 10Andrea Denisse: prometheus: Added support for syncing data between instances [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [02:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:20:59] (03CR) 10Andrea Denisse: prometheus: Added support for syncing data between instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:44:22] (03CR) 10Dzahn: [C: 03+1] prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [04:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:49:02] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:50:32] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:06:58] (03CR) 10Marostegui: "Thanks a lot for this. Sorry for the typo. I did monitor the change, but I guess puppet didn't run everywhere before I declared green ligh" [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455) (owner: 10Cwhite) [05:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:12:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [05:14:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [05:32:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [05:35:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [05:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:49:50] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:52:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [05:53:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [05:56:18] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:56:56] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: user@115.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) Great work Traffic team (and you're the first SRE sub team to have completed their migration off Buster)! [05:59:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482) (owner: 10Dzahn) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T0600). [06:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:08:28] 10SRE, 10ops-codfw: Broken PSU on ganeti2019 - https://phabricator.wikimedia.org/T335026 (10MoritzMuehlenhoff) Thanks! [06:09:28] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on krb2002.codfw.wmnet with reason: Non-functional, WIP for Bullseye update [06:09:50] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:09:52] PROBLEM - Confd vcl based reload on cp4039 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:09:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on krb2002.codfw.wmnet with reason: Non-functional, WIP for Bullseye update [06:09:58] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=22bf6bdd-4c99-40f0-ab28-bb73d3bcbf21) set by jmm@cumin2002 for 6 days, 0:00:00 on 1 host(s) and their services... [06:10:02] PROBLEM - Confd vcl based reload on cp2027 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:02] PROBLEM - Confd vcl based reload on cp1079 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:03] uhm this is me I think [06:10:04] PROBLEM - Confd vcl based reload on cp2037 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:05] confd [06:10:06] PROBLEM - Confd vcl based reload on cp3064 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:06] PROBLEM - Confd vcl based reload on cp3060 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:07] PROBLEM - Confd vcl based reload on cp3050 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:07] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:08] PROBLEM - Confd vcl based reload on cp1075 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:12] PROBLEM - Confd vcl based reload on cp5018 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:12] PROBLEM - Confd vcl based reload on cp5021 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:16] PROBLEM - Confd vcl based reload on cp1087 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:16] PROBLEM - Confd vcl based reload on cp4038 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:22] PROBLEM - Confd vcl based reload on cp5020 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:22] PROBLEM - Confd vcl based reload on cp5019 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:22] PROBLEM - Confd vcl based reload on cp2039 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:24] PROBLEM - Confd vcl based reload on cp4044 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:24] PROBLEM - Confd vcl based reload on cp3056 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:26] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:26] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:26] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:26] PROBLEM - Confd vcl based reload on cp1089 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:26] PROBLEM - Confd vcl based reload on cp4040 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:26] PROBLEM - Confd vcl based reload on cp5017 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:32] PROBLEM - Confd vcl based reload on cp5022 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:32] PROBLEM - Confd vcl based reload on cp1077 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:42] PROBLEM - Confd vcl based reload on cp3054 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:42] PROBLEM - Confd vcl based reload on cp4042 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:44] PROBLEM - Confd vcl based reload on cp4041 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:10:50] sigh the stupidest error [06:11:24] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:28] RECOVERY - Confd vcl based reload on cp4039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:34] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:36] RECOVERY - Confd vcl based reload on cp2027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:38] RECOVERY - Confd vcl based reload on cp1079 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:40] RECOVERY - Confd vcl based reload on cp2037 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:42] RECOVERY - Confd vcl based reload on cp3064 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:42] RECOVERY - Confd vcl based reload on cp3060 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:42] RECOVERY - Confd vcl based reload on cp3050 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:42] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:44] RECOVERY - Confd vcl based reload on cp1075 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:44] at least it's easy and fast to fix I think [06:11:46] RECOVERY - Confd vcl based reload on cp5021 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:48] RECOVERY - Confd vcl based reload on cp5018 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:50] RECOVERY - Confd vcl based reload on cp1087 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:52] RECOVERY - Confd vcl based reload on cp4038 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:56] RECOVERY - Confd vcl based reload on cp2039 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:58] RECOVERY - Confd vcl based reload on cp5020 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:58] RECOVERY - Confd vcl based reload on cp5019 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:11:58] RECOVERY - Confd vcl based reload on cp4044 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:00] RECOVERY - Confd vcl based reload on cp1089 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:00] RECOVERY - Confd vcl based reload on cp3056 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:00] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:00] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:00] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:00] RECOVERY - Confd vcl based reload on cp4040 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:02] RECOVERY - Confd vcl based reload on cp5017 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:06] RECOVERY - Confd vcl based reload on cp5022 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:08] RECOVERY - Confd vcl based reload on cp1077 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:16] RECOVERY - Confd vcl based reload on cp4042 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:16] RECOVERY - Confd vcl based reload on cp3054 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:12:20] RECOVERY - Confd vcl based reload on cp4041 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [06:14:42] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:05] !log enabled requestctl rule for T332061 [06:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:33] !log installing tomcat9 security updates [06:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:24:54] moritzm: I won't dare asking where do we have tomcat [06:24:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 14593 [06:25:17] I have suspicions, in that case I feel your pain :P [06:25:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 14593 [06:29:06] 10SRE, 10Infrastructure-Foundations: Determine which sender address to use for email notification - https://phabricator.wikimedia.org/T335091 (10SLyngshede-WMF) [06:29:15] the IDPs and puppetdb essentially, CAS does the the full stack with WAR deploys and all the joy. puppetdb only uses some classes internally [06:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:30:23] (03PS2) 10Stevemunene: Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) [06:30:47] (03CR) 10CI reject: [V: 04-1] Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [06:32:56] (03PS3) 10Stevemunene: Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) [06:37:20] (03CR) 10Ayounsi: [C: 03+2] mgmt: allow prometheus [homer/public] - 10https://gerrit.wikimedia.org/r/909980 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [06:55:01] (03CR) 10Muehlenhoff: [C: 03+1] thanos-fe: proper insetup Puppet roles to machine [puppet] - 10https://gerrit.wikimedia.org/r/906023 (owner: 10Alexandros Kosiaris) [06:57:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [06:57:44] (03PS1) 10Marostegui: install_server: Do not reimage db1215 [puppet] - 10https://gerrit.wikimedia.org/r/910413 (https://phabricator.wikimedia.org/T326669) [06:58:17] (03CR) 10Marostegui: [C: 03+1] auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 (owner: 10Ladsgroup) [06:58:39] (03CR) 10Marostegui: [C: 03+1] "Can you make sure to update the doc in case it needs changing?" [software] - 10https://gerrit.wikimedia.org/r/910057 (owner: 10Ladsgroup) [06:58:42] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1215 [puppet] - 10https://gerrit.wikimedia.org/r/910413 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:59:30] (03CR) 10Marostegui: [C: 03+1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [07:00:04] Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T0700). [07:00:10] morning! once again there are no trainees signed up for this slot, and once again that's just as well, since there are no patches scheduled for deployment. have a nice Earth Day tomorrow for folks in the US, and a quiet rest of the week in any case! [07:01:26] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [07:01:59] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) @jcrespo kindly check what is needed for backup involved hosts, thanks! [07:24:39] !log uploaded imagemagick 8:6.9.10.23+dfsg-2.1+deb10u1+wmf1 to apt.wikimedia.org for buster-wikimedia T328901 [07:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:32] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10jcrespo) [07:30:39] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10jcrespo) >>! In T335042#8795210, @Marostegui wrote: > @jcrespo kindly check what is needed for backup involved hosts, thanks! Done. [07:32:54] (03PS1) 10Marostegui: change_mw_mysql_pass.sh: Change zarcillo host [software] - 10https://gerrit.wikimedia.org/r/910414 (https://phabricator.wikimedia.org/T334455) [07:33:38] (03CR) 10Marostegui: [C: 03+2] change_mw_mysql_pass.sh: Change zarcillo host [software] - 10https://gerrit.wikimedia.org/r/910414 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [07:34:07] (03Merged) 10jenkins-bot: change_mw_mysql_pass.sh: Change zarcillo host [software] - 10https://gerrit.wikimedia.org/r/910414 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [07:35:37] (03CR) 10Jcrespo: "All in all, I am happy with the result: the issue gave a clear error on icinga, and no metric was lost because the script refused to conti" [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455) (owner: 10Cwhite) [07:36:16] (03CR) 10Marostegui: prometheus: Change zarcillo location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455) (owner: 10Cwhite) [07:41:12] (03PS1) 10Elukey: profile::puppet_compiler::clean_reports: increase cleanup frequency [puppet] - 10https://gerrit.wikimedia.org/r/910415 [07:41:42] (03CR) 10Elukey: "John lemme know if this is ok or a bad idea, afaics it is only used on the pcc hosts so it should be safe to test :)" [puppet] - 10https://gerrit.wikimedia.org/r/910415 (owner: 10Elukey) [07:49:49] (03CR) 10Muehlenhoff: [C: 03+1] cassandra: add de-init to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [07:50:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [07:59:49] (03CR) 10Elukey: "Tested the rules locally on dse-k8s-worker1001:" [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:00:05] jnuche and ^demon: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T0800). [08:00:30] morning, I'll deploy the train to group2 in 10m [08:04:17] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) Ah thanks @Dzahn I didn't think to use LDAP for this. I'll add it to the documentation. [08:06:43] (03PS1) 10Clément Goubert: admin: Add andrewtavis-wmde [puppet] - 10https://gerrit.wikimedia.org/r/910417 (https://phabricator.wikimedia.org/T334960) [08:09:36] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910418 (https://phabricator.wikimedia.org/T330211) [08:09:38] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910418 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [08:10:20] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910418 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [08:12:38] (03CR) 10Clément Goubert: [C: 03+2] admin: move fnavas to ldap_only admins, remove from a-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482) (owner: 10Dzahn) [08:13:48] (03PS1) 10Elukey: amd-gpu-tester: replace rocblas with rocblas-dev [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910419 (https://phabricator.wikimedia.org/T333009) [08:17:10] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.5 refs T330211 [08:17:16] T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 [08:23:05] (03CR) 10Elukey: [C: 03+2] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:26:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: replace rocblas with rocblas-dev [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910419 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:28:48] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) Thanks all for the help with this! Let me know if anything else is needed on my end, and the email is in the Meta Wiki for this account [[ ht... [08:31:36] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) Hey @AndrewTavis_WMDE, thanks for the email confirmation. I'm just waiting for a review on the patch, then your access will be created. I'll u... [08:31:55] (03CR) 10Ayounsi: [C: 03+1] replace gerrit1001 with gerrit1003 as ping target for blackbox smoke [puppet] - 10https://gerrit.wikimedia.org/r/909791 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [08:31:57] (03PS2) 10Clément Goubert: admin: Add andrewtavis-wmde [puppet] - 10https://gerrit.wikimedia.org/r/910417 (https://phabricator.wikimedia.org/T334960) [08:32:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) Thanks for the quick reply! This now works: ` prometheus1006:~$ curl lsw1-e8-eqiad.mgmt.eqiad.wmnet:9100/metrics | wc -l 3412 ` I guess next step is to s... [08:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:35:18] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) Wonderful :) Thank you, @Clement_Goubert! [08:42:54] (03CR) 10Muehlenhoff: "Looks good in general, I have three final comments." [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [08:52:33] (03CR) 10Ladsgroup: change_mw_mysql_pass.sh: Change zarcillo host (031 comment) [software] - 10https://gerrit.wikimedia.org/r/910414 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [08:56:08] (03CR) 10Marostegui: [C: 03+2] "let's leave it there for now" [software] - 10https://gerrit.wikimedia.org/r/910414 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [08:57:56] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:04:02] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.18 in codfw [09:04:06] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Get rid of concept of skipping replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/910057 (owner: 10Ladsgroup) [09:04:12] (03CR) 10CI reject: [V: 04-1] auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 (owner: 10Ladsgroup) [09:06:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.18 in codfw [09:07:01] (03CR) 10Btullis: Configure product analytics airflow instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [09:11:11] (03PS1) 10Barakat Ajadi: Remove enabling of Central Notice Timing in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910427 (https://phabricator.wikimedia.org/T334550) [09:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:12:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [09:12:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/910417 (https://phabricator.wikimedia.org/T334960) (owner: 10Clément Goubert) [09:13:32] (03CR) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py (0311 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [09:14:02] (03CR) 10Clément Goubert: [C: 03+2] admin: Add andrewtavis-wmde [puppet] - 10https://gerrit.wikimedia.org/r/910417 (https://phabricator.wikimedia.org/T334960) (owner: 10Clément Goubert) [09:14:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [09:14:49] (03CR) 10Barakat Ajadi: "Hi, Please review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910427 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [09:17:33] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) [09:17:40] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) 05In progress→03Resolved All done, your access should be fully operational in the next half-hour. Feel free to reopen if you encounter any issue. [09:18:38] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) Thank you, @Clement_Goubert and everyone else for the help! [09:20:12] 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cable labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney) Just noting for when we tackle this, the current process also defaults all cables to black DACs. Which is mostly fine, but probably we sho... [09:23:25] (03PS1) 10Ladsgroup: Migrate 2023 schema changes to use get_columns/get_indexes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/910428 (https://phabricator.wikimedia.org/T304654) [09:23:33] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [09:23:47] (03CR) 10CI reject: [V: 04-1] Migrate 2023 schema changes to use get_columns/get_indexes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/910428 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [09:24:02] (03Merged) 10jenkins-bot: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [09:25:49] (03PS1) 10Elukey: admin_ng: set 'deploy' for ores-legacy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/910429 (https://phabricator.wikimedia.org/T330414) [09:29:03] (03CR) 10MVernon: [C: 03+1] "Modulo my slight confusion below, this looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [09:29:53] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set 'deploy' for ores-legacy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/910429 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:30:17] (03CR) 10MVernon: [C: 03+1] Do not de-init node prior to restart (031 comment) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [09:31:31] (03PS2) 10Ladsgroup: Migrate 2023 schema changes to use get_columns/get_indexes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/910428 (https://phabricator.wikimedia.org/T304654) [09:32:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [09:32:42] (03CR) 10Elukey: [C: 03+2] admin_ng: set 'deploy' for ores-legacy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/910429 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:32:50] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Clement_Goubert) [09:33:27] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [09:35:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [09:35:32] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:35:41] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:35:51] (03PS1) 10Giuseppe Lavagetto: Rakefile: bump istioctl version [deployment-charts] - 10https://gerrit.wikimedia.org/r/910431 [09:40:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:40:25] (03CR) 10CI reject: [V: 04-1] Rakefile: bump istioctl version [deployment-charts] - 10https://gerrit.wikimedia.org/r/910431 (owner: 10Giuseppe Lavagetto) [09:40:27] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:42:18] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [09:42:37] (03CR) 10Marostegui: [C: 03+1] Migrate 2023 schema changes to use get_columns/get_indexes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/910428 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [09:42:50] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:42:54] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:42:57] (03CR) 10Ladsgroup: [C: 03+2] Migrate 2023 schema changes to use get_columns/get_indexes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/910428 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [09:43:21] (03Merged) 10jenkins-bot: Migrate 2023 schema changes to use get_columns/get_indexes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/910428 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [09:43:53] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:45:02] (03CR) 10JMeybohm: [C: 03+1] Rakefile: bump istioctl version [deployment-charts] - 10https://gerrit.wikimedia.org/r/910431 (owner: 10Giuseppe Lavagetto) [09:47:02] (03PS1) 10Marostegui: pc2011: Master for pc1 [puppet] - 10https://gerrit.wikimedia.org/r/910435 [09:47:44] (03CR) 10Marostegui: [C: 03+2] pc2011: Master for pc1 [puppet] - 10https://gerrit.wikimedia.org/r/910435 (owner: 10Marostegui) [09:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:49:22] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10HasanAkgun_WMDE) [09:52:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:53:20] (03PS1) 10Muehlenhoff: sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) [09:53:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:55:47] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:59:56] (03PS2) 10Muehlenhoff: sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) [10:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1000). [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1000) [10:03:43] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/910431 (owner: 10Giuseppe Lavagetto) [10:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:06:25] (03CR) 10Jbond: [C: 04-1] "-1 is because i think the error may lie else where" [puppet] - 10https://gerrit.wikimedia.org/r/910415 (owner: 10Elukey) [10:12:26] (03CR) 10Elukey: profile::puppet_compiler::clean_reports: increase cleanup frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910415 (owner: 10Elukey) [10:16:57] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert [10:17:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: bump istioctl version [deployment-charts] - 10https://gerrit.wikimedia.org/r/910431 (owner: 10Giuseppe Lavagetto) [10:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:22:47] (03CR) 10Volans: "I think there is a small problem, see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:24:13] (03Merged) 10jenkins-bot: Rakefile: bump istioctl version [deployment-charts] - 10https://gerrit.wikimedia.org/r/910431 (owner: 10Giuseppe Lavagetto) [10:27:28] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10Clement_Goubert) [10:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:31:22] PSU and "firing" gave me a concern there for a moment. [10:34:19] 10ops-codfw, 10Traffic: Broken PSU on cp2031 - https://phabricator.wikimedia.org/T335110 (10MoritzMuehlenhoff) [10:37:55] (03PS1) 10Elukey: amd-gpu-tester: fix/add more ROCm packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910440 [10:37:59] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10Clement_Goubert) @HasanAkgun_WMDE I have sent you an email for out-of-band verification of your ssh key. @karapayneWMDE As approver for WMDE, can you approve this request? @thcipri... [10:40:08] (03CR) 10Jbond: [C: 04-1] profile::puppet_compiler::clean_reports: increase cleanup frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910415 (owner: 10Elukey) [10:40:15] (03CR) 10Volans: [C: 03+1] "LGTM, couple of optional nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [10:40:23] (03PS1) 10Elukey: admin_ng: set deployTLSCertificate for ores-legacy in ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/910442 (https://phabricator.wikimedia.org/T330414) [10:40:41] (03Abandoned) 10Elukey: profile::puppet_compiler::clean_reports: increase cleanup frequency [puppet] - 10https://gerrit.wikimedia.org/r/910415 (owner: 10Elukey) [10:40:45] (03CR) 10Elukey: profile::puppet_compiler::clean_reports: increase cleanup frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910415 (owner: 10Elukey) [10:41:27] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw [10:42:55] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: fix/add more ROCm packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910440 (owner: 10Elukey) [10:43:57] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw [10:48:29] (03PS3) 10Muehlenhoff: sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) [10:55:00] (03CR) 10Elukey: [C: 03+2] admin_ng: set deployTLSCertificate for ores-legacy in ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/910442 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:57:17] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:57:39] !log installing openvswitch security updates on bullseye [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:44] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:07:44] (03CR) 10Klausman: [C: 03+1] amd_gpu: add udev rules to bypass the 'render' group [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:07:58] (03CR) 10Klausman: [C: 03+1] role:dse_k8s::worker: set allow_gpu_broader_access [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:08:26] (03CR) 10Klausman: [C: 03+1] admin_ng: set deployTLSCertificate for ores-legacy in ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/910442 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [11:10:55] (03CR) 10Giuseppe Lavagetto: "Please take a look at the new scaffolding structure, and also to the resulting charts when you try to use create-new-service.sh 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [11:20:57] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10Clement_Goubert) [11:25:59] (03PS4) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) [11:26:39] (03CR) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [11:30:22] (03CR) 10Volans: [C: 03+1] "LGTM, left open question on the hardcoding of the path" [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [11:31:38] (03CR) 10Volans: [C: 03+1] "ship it :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [11:39:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans) p:05Low→03Medium a:03Volans [11:45:17] (03PS1) 10Volans: sre.hosts.reimage: improve failed first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) [11:46:03] (03CR) 10Muehlenhoff: sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/910438 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [11:56:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:58:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:03] jouncebot: nowandnext [12:02:03] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [12:02:04] In 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1300) [12:02:04] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1300) [12:02:23] (03PS6) 10Ladsgroup: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) (owner: 10Aklapper) [12:02:28] (03CR) 10Ladsgroup: [C: 03+2] Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) (owner: 10Aklapper) [12:02:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) (owner: 10Aklapper) [12:03:28] (03Merged) 10jenkins-bot: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) (owner: 10Aklapper) [12:04:11] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:888708|Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki (T124748)]] [12:04:17] T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [12:05:30] !log ladsgroup@deploy2002 aklapper and ladsgroup: Backport for [[gerrit:888708|Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki (T124748)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [12:11:59] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:888708|Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki (T124748)]] (duration: 07m 48s) [12:12:05] T124748: Deprecate Graph namespace on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [12:12:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:16:03] (03Abandoned) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [12:17:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40768/console" [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:25:32] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add script to create fs and raid for backup partition [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:30:04] (03PS3) 10Samtar: Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) [12:30:46] (03CR) 10CI reject: [V: 04-1] Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) (owner: 10Samtar) [12:33:00] (03PS4) 10Samtar: Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) [12:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:40:03] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Ottomata) @Dzahn I believe that @FNavas-foundation needs LDAP wmf AND ssh-less membership in an... [12:42:07] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw [12:44:36] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw [12:48:30] (03PS1) 10Phuedx: Remove deprecated all_settings streamconfigs param [deployment-charts] - 10https://gerrit.wikimedia.org/r/910471 (https://phabricator.wikimedia.org/T286344) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1300). nyaa~ [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:41] no depwoyments today [13:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:17:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:19:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:20:40] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @Dzahn @Ottomata OK! good news is I am IN Superset. But now I cannot see the... [13:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:46] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Clement_Goubert) Ok, I'll be reverting the patch to reinstate you in the `analytics-privatedata... [13:30:08] (03CR) 10CI reject: [V: 04-1] cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [13:30:39] (03PS1) 10Clément Goubert: Revert "admin: move fnavas to ldap_only admins, remove from a-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910489 [13:30:51] (03CR) 10CI reject: [V: 04-1] Revert "admin: move fnavas to ldap_only admins, remove from a-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910489 (owner: 10Clément Goubert) [13:32:18] (03PS2) 10Clément Goubert: Revert "admin: move fnavas to ldap_only admins, remove from a-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910489 [13:33:11] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw [13:35:40] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw [13:35:47] (03PS2) 10Clément Goubert: cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 [13:37:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [13:38:06] (03CR) 10CI reject: [V: 04-1] cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [13:40:15] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [13:43:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [13:45:53] (03PS3) 10Clément Goubert: cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 [13:48:05] (03CR) 10CI reject: [V: 04-1] cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [13:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:50:20] (03PS4) 10Clément Goubert: cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 [13:51:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10karapayneWMDE) I approve! [13:51:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10karapayneWMDE) I approve! [13:52:52] (03CR) 10CI reject: [V: 04-1] cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [13:52:59] * claime shakes fist [13:57:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [13:58:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [13:58:39] (03PS1) 10Lucas Werkmeister (WMDE): Make $wmgUseGraphWithJsonNamespace depend on $wmgUseJsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910479 (https://phabricator.wikimedia.org/T335130) [14:01:33] jouncebot: now [14:01:33] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [14:02:15] if someone can take a look at my config change ^ I could deploy it soon and (hopefully) unbreak collabwiki [14:03:37] (03CR) 10Lucas Werkmeister (WMDE): "Adding some people from I4bd15deb81 since this is related." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910479 (https://phabricator.wikimedia.org/T335130) (owner: 10Lucas Werkmeister (WMDE)) [14:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:09:29] (03CR) 10Elukey: [C: 03+2] amd_gpu: add udev rules to bypass the 'render' group [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [14:09:46] (03CR) 10Elukey: [C: 03+2] role:dse_k8s::worker: set allow_gpu_broader_access [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [14:10:43] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/909712 (owner: 10Hnowlan) [14:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:26:11] (03PS3) 10Ssingh: varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) [14:26:14] (03Abandoned) 10JMeybohm: k8s: Rename kubernetes_cluster_groups to kubernetes_clusters [puppet] - 10https://gerrit.wikimedia.org/r/909686 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:27:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40778/console" [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:28:33] (03CR) 10Zabe: [C: 03+1] "sounds reasonable,I think there was a historical dependency which is no longer there. but I am not the biggest Graph/JsonConfig expert." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910479 (https://phabricator.wikimedia.org/T335130) (owner: 10Lucas Werkmeister (WMDE)) [14:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:33:17] !log depooling parse2010 for PSU failure [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:39] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 (10Clement_Goubert) [14:37:48] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 (10Clement_Goubert) ipmi-sel log: ` 20 | Apr-18-2023 | 16:40:36 | PS Redundancy | Power Supply | Redundancy Lost 21 | Apr-18-2023 | 16:41:04 | Status | Powe... [14:39:25] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) thanks @Clement_Goubert - please advise when you do. I get the following err... [14:39:39] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on parse2010.codfw.wmnet with reason: PSU failure [14:39:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on parse2010.codfw.wmnet with reason: PSU failure [14:40:15] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=07c3aae8-959c-4efb-860f-0f459d621701) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services wi... [14:40:48] I’ll try deploying https://gerrit.wikimedia.org/r/910479 and see if it fixes the problem [14:41:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910479 (https://phabricator.wikimedia.org/T335130) (owner: 10Lucas Werkmeister (WMDE)) [14:42:11] (03PS5) 10Clément Goubert: cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 [14:42:28] (03Merged) 10jenkins-bot: Make $wmgUseGraphWithJsonNamespace depend on $wmgUseJsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910479 (https://phabricator.wikimedia.org/T335130) (owner: 10Lucas Werkmeister (WMDE)) [14:42:41] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:910479|Make $wmgUseGraphWithJsonNamespace depend on $wmgUseJsonConfig (T335130)]] [14:42:46] T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130 [14:43:19] (03PS1) 10Bking: wdqs: activate wdqs2022 [puppet] - 10https://gerrit.wikimedia.org/r/910507 (https://phabricator.wikimedia.org/T331300) [14:43:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:910479|Make $wmgUseGraphWithJsonNamespace depend on $wmgUseJsonConfig (T335130)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:44:34] seems to work, `mwscript dumpBackup collabwiki --current --start 9176 --end 9177` no longer crashes on mwdebug2001 [14:44:47] syncing [14:45:26] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1006.eqiad.wmnet [14:45:27] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [14:45:57] o_O the scap output says we only have four canaries right now? [14:45:58] (03CR) 10Stevemunene: [C: 03+2] Add Product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909951 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [14:46:04] isn’t it usually like ten or twelve? [14:47:49] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-airflow1006.eqiad.wmnet - stevemunene@cumin1001" [14:48:29] (03PS4) 10Ssingh: varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) [14:48:42] I guess codfw has fewer canaries then eqiad, ok (if I understand operations/puppet correctly) [14:49:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/910489 (owner: 10Clément Goubert) [14:49:13] I think its 4 instead of 9 [14:49:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-airflow1006.eqiad.wmnet - stevemunene@cumin1001" [14:49:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:32] !log stevemunene@cumin1001 START - Cookbook sre.dns.wipe-cache an-airflow1006.eqiad.wmnet on all recursors [14:49:35] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1006.eqiad.wmnet on all recursors [14:49:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40779/console" [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:50:22] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:910479|Make $wmgUseGraphWithJsonNamespace depend on $wmgUseJsonConfig (T335130)]] (duration: 07m 40s) [14:50:27] T335130: The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) - https://phabricator.wikimedia.org/T335130 [14:50:35] (03CR) 10Clément Goubert: [C: 03+2] Revert "admin: move fnavas to ldap_only admins, remove from a-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910489 (owner: 10Clément Goubert) [14:51:09] * Lucas_WMDE done [14:51:50] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Clement_Goubert) I merged the patch, it should be live everywhere in around half an hour. [14:51:51] (03CR) 10BBlack: [C: 03+1] varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:53:14] (03CR) 10Vgutierrez: [C: 03+1] varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:53:40] (03PS1) 10Stevemunene: Add dummy keytabs for new an-aiflow1006 [labs/private] - 10https://gerrit.wikimedia.org/r/910508 (https://phabricator.wikimedia.org/T333000) [14:55:22] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new an-aiflow1006 [labs/private] - 10https://gerrit.wikimedia.org/r/910508 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [14:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) Verified cables and corrected them @Dwisehaupt [14:56:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) 05In progress→03Resolved {F36957587} [14:56:55] !log disable puppet in A:cp and A:eqsin to test CR 910005 [14:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:47] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:59:34] steve_munene: merging your labs/private commit [14:59:50] ack sukhe [15:00:09] thanks [15:00:15] np! [15:00:40] (03PS1) 10JMeybohm: Make kubernetes::clusters the central place for k8s config #2 [puppet] - 10https://gerrit.wikimedia.org/r/910509 (https://phabricator.wikimedia.org/T325268) [15:11:23] (03CR) 10Eevans: cassandra: add de-init to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [15:15:15] (03CR) 10Andrew Bogott: "lgtm although I wish I could remember what I meant by 'puppet requires ldap.' Probably that's no longer true, even if it ever was." [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [15:15:20] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Jclark-ctr) 05Open→03Resolved [15:15:48] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1116 - https://phabricator.wikimedia.org/T334926 (10Jclark-ctr) 05Open→03Resolved [15:15:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2006.wikimedia.org with OS bullseye [15:15:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye [15:16:11] (03CR) 10JHathaway: replace puppet::config with concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [15:16:13] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352 (10Jclark-ctr) 05Open→03Resolved [15:16:53] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 (10Jclark-ctr) [15:16:59] (03PS1) 10Ssingh: varnish: update template and add missing \ [puppet] - 10https://gerrit.wikimedia.org/r/910515 [15:17:02] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 (10Jclark-ctr) 05Open→03Resolved [15:17:18] (03CR) 10JHathaway: [C: 03+2] replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [15:18:05] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40782/console" [puppet] - 10https://gerrit.wikimedia.org/r/910515 (owner: 10Ssingh) [15:18:44] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 (10Jclark-ctr) 05Open→03Resolved [15:19:07] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1102 - https://phabricator.wikimedia.org/T334927 (10Jclark-ctr) [15:19:10] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: update template and add missing \ [puppet] - 10https://gerrit.wikimedia.org/r/910515 (owner: 10Ssingh) [15:19:56] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1102 - https://phabricator.wikimedia.org/T334927 (10Jclark-ctr) 05Open→03Resolved [15:20:08] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/910517 [15:20:29] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:21:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:33] !log varnish-frontend-restart cp5022 [15:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:06] (03CR) 10Eevans: [V: 03+2 C: 03+2] Do not de-init node prior to restart [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [15:26:19] !log re-enable puppet in A:cp and A:eqsin [15:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:23] (03CR) 10Eevans: [C: 03+2] cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [15:27:38] (03CR) 10Eevans: cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [15:27:40] !log run puppet manually in A:cp and A:eqsin to pick up CR 910005 [15:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:25] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) That's all working now. Thank you to everyone involved! [15:28:35] (03CR) 10Gehel: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/910507 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [15:28:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2006.wikimedia.org with reason: host reimage [15:29:38] RECOVERY - IPMI Sensor Status on parse2010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:30:00] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [15:30:03] (03PS1) 10Cwhite: prometheus::ops: add demo node exporter job for SONiC [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) [15:31:10] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Dzahn) Thanks all as well from me. Sorry FNavas, this should have been easier! All my comments... [15:32:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2006.wikimedia.org with reason: host reimage [15:32:37] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: add FNavas-foundation to wmf LDAP group (was: Grant Access to analytics_privatedata_users for FNavas-foundation) - https://phabricator.wikimedia.org/T331482 (10Dzahn) 05In progress→03Resolved a:05Dzahn→03Clement_Goubert [15:32:57] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [15:33:13] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [15:33:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/910507 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [15:35:20] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 (10Papaul) 05Open→03Resolved @Clement_Goubert fixed [15:36:36] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM an-airflow1006.eqiad.wmnet - stevemunene@cumin1001" [15:36:47] (03PS1) 10Elukey: ml-services: enable mesh for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/910521 [15:37:38] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM an-airflow1006.eqiad.wmnet - stevemunene@cumin1001" [15:37:39] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1006.eqiad.wmnet [15:38:21] !log sudo cumin -b1 -s1200 'A:cp and A:eqsin' 'varnish-frontend-restart' [15:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:57] (03CR) 10Cwhite: "PCC ok: https://puppet-compiler.wmflabs.org/output/910083/40785/" [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [15:39:01] (03CR) 10Bking: wdqs: activate wdqs2022 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910507 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [15:41:24] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for Product Analytics Airflow - https://phabricator.wikimedia.org/T334836 (10Stevemunene) Make vm with `sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --network analytics --cluster eqiad --group B an-airflow1006` End... [15:44:04] RECOVERY - IPMI Sensor Status on cp2031 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:44:22] !log deploying weekly deployment train for analytics refinery. [15:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:14] !log ebysans@deploy2002 Started deploy [analytics/refinery@1631dea]: Regular analytics weekly train [analytics/refinery@1631dea] [15:46:22] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:47:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:48:54] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-airflow1006.eqiad.wmnet with OS buster [15:48:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:48:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2006.wikimedia.org with OS bullseye [15:49:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye completed: - dns2006 (**PASS**) - Removed from Pup... [15:49:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [15:49:46] (03CR) 10Dzahn: [C: 03+2] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [15:50:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) 05Open→03Resolved @BBlack all yours [15:50:22] (03CR) 10Herron: prometheus: Added support for syncing data between instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:51:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:24] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Papaul) @jcrespo suggested on IRC that we wait after the dc switch over before rebooting the server . ` papaul: not sure if he is still around, but that may be a complex operation, unless it is an emergency maybe better... [15:54:45] !log ebysans@deploy2002 Finished deploy [analytics/refinery@1631dea]: Regular analytics weekly train [analytics/refinery@1631dea] (duration: 08m 30s) [15:55:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add bast2003 DNS entries - pt1979@cumin2002" [15:55:19] 10SRE, 10ops-codfw, 10Traffic: Broken PSU on cp2031 - https://phabricator.wikimedia.org/T335110 (10Papaul) 05Open→03Resolved a:03Papaul this is complete [15:55:42] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10jcrespo) ^ @Marostegui [15:56:29] !log ebysans@deploy2002 Started deploy [analytics/refinery@1631dea] (thin): Regular analytics weekly train THIN [analytics/refinery@1631dea] [15:56:34] (03CR) 10Dzahn: [C: 03+1] prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:56:38] !log ebysans@deploy2002 Finished deploy [analytics/refinery@1631dea] (thin): Regular analytics weekly train THIN [analytics/refinery@1631dea] (duration: 00m 08s) [15:57:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add bast2003 DNS entries - pt1979@cumin2002" [15:57:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:58:12] !log ebysans@deploy2002 Started deploy [analytics/refinery@1631dea] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1631dea] [15:58:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host bast2003.mgmt.codfw.wmnet with reboot policy FORCED [15:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:42] !log ebysans@deploy2002 Finished deploy [analytics/refinery@1631dea] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1631dea] (duration: 01m 29s) [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:30] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1006.eqiad.wmnet with reason: host reimage [16:02:01] 10ops-codfw: codfw:sretest2001 for Infrastructure - https://phabricator.wikimedia.org/T320524 (10Papaul) [16:02:15] (03PS1) 10Eevans: Breakfixes from the 2to3 conversion [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/910524 (https://phabricator.wikimedia.org/T334754) [16:02:58] (03CR) 10Eevans: [V: 03+2 C: 03+2] Breakfixes from the 2to3 conversion [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/910524 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [16:03:48] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:04:54] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1006.eqiad.wmnet with reason: host reimage [16:07:32] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setting sretest2001 back to offine - pt1979@cumin2002" [16:08:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setting sretest2001 back to offine - pt1979@cumin2002" [16:08:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:08:35] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for Product Analytics Airflow - https://phabricator.wikimedia.org/T334836 (10Stevemunene) 05Open→03Resolved [16:09:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host bast2003.mgmt.codfw.wmnet with reboot policy FORCED [16:12:46] (03CR) 10Eevans: [C: 03+2] cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [16:15:31] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['bast2003'] [16:15:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['bast2003'] [16:15:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['bast2003'] [16:16:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['bast2003'] [16:16:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['bast2003'] [16:16:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['bast2003'] [16:17:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10Papaul) [16:18:37] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10CCoxwell-WMF) [16:20:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host an-airflow1006.eqiad.wmnet with OS buster [16:22:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse2010.codfw.wmnet [16:22:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse2010.codfw.wmnet [16:23:20] (03CR) 10Dzahn: [C: 03+2] "works - shows up here now: https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*gerrit.*%22%7D&g0.tab=1&g0.stacked=" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:23:50] !log repooling parse2010 after fix - T335138 [16:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:55] T335138: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 [16:24:07] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 (10Clement_Goubert) Thanks! [16:25:56] !log Deployed refinery using scap, then deployed onto hdfs as part of weekly deployment train. [16:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:17] 10SRE, 10ops-codfw, 10Traffic: Broken PSU on cp2031 - https://phabricator.wikimedia.org/T335110 (10MoritzMuehlenhoff) Thanks! [16:31:36] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[10,13,16,19].eqiad.wmnet: Testing rolling restart (rack1) — T334754 - eevans@cumin1001 [16:31:42] T334754: Move deinitialization from c-foreach-restart to Cassandra's systemd unit - https://phabricator.wikimedia.org/T334754 [16:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:35:21] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10bking) a:03bking [16:39:12] (03CR) 10Cparle: [C: 03+2] structured-data: Add metric alert for section image suggestions. [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) (owner: 10Xcollazo) [16:41:29] (03Merged) 10jenkins-bot: structured-data: Add metric alert for section image suggestions. [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) (owner: 10Xcollazo) [16:47:05] (03CR) 10Btullis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:47:11] (03CR) 10CI reject: [V: 04-1] Add a custom ceph_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:51:45] (03PS6) 10Btullis: Add a custom ceph_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) [16:52:21] (03CR) 10CI reject: [V: 04-1] Add a custom ceph_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:53:46] (03PS7) 10Btullis: Add a custom ceph_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) [16:54:13] (03PS8) 10Btullis: Add a custom ceph_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) [16:57:24] (03CR) 10Dzahn: [C: 03+2] replace gerrit1001 with gerrit1003 as ping target for blackbox smoke [puppet] - 10https://gerrit.wikimedia.org/r/909791 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:58:36] (03CR) 10Dzahn: [C: 03+2] add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn) [16:59:06] (03PS1) 10C. Scott Ananian: Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T55784) [16:59:51] (03CR) 10CI reject: [V: 04-1] Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T55784) (owner: 10C. Scott Ananian) [17:00:04] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1700) [17:02:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[10,13,16,19].eqiad.wmnet: Testing rolling restart (rack1) — T334754 - eevans@cumin1001 [17:02:48] T334754: Move deinitialization from c-foreach-restart to Cassandra's systemd unit - https://phabricator.wikimedia.org/T334754 [17:05:41] (03CR) 10Ssingh: [C: 03+1] acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [17:06:54] (03CR) 10Herron: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:08:48] (03PS2) 10C. Scott Ananian: Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T55784) [17:10:27] (03CR) 10Dzahn: [C: 03+2] acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [17:10:50] (03CR) 10Herron: [C: 03+1] logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite) [17:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:12:59] (03CR) 10CI reject: [V: 04-1] logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite) [17:14:00] (03PS3) 10C. Scott Ananian: Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) [17:16:05] (03PS1) 10Papaul: Add bast2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/910558 (https://phabricator.wikimedia.org/T334287) [17:17:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for parse2010.codfw.wmnet - https://phabricator.wikimedia.org/T335138 (10Aklapper) [17:17:30] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:19:10] (03CR) 10Dzahn: [C: 03+1] prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:19:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:20:11] 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10bking) [17:21:51] (03PS1) 10Zabe: Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) [17:22:10] (03CR) 10Papaul: [C: 03+2] Add bast2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/910558 (https://phabricator.wikimedia.org/T334287) (owner: 10Papaul) [17:22:33] (03CR) 10CI reject: [V: 04-1] Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) (owner: 10Zabe) [17:22:42] (03PS2) 10Zabe: Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) [17:23:00] (03CR) 10Dzahn: "@Legoktm Do you have any opinion about this?" [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [17:23:23] (03CR) 10CI reject: [V: 04-1] Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) (owner: 10Zabe) [17:24:10] (03PS3) 10Zabe: Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) [17:24:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS bullseye [17:24:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host bast2003.wikimedia.org with OS bullseye [17:25:54] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10Umherirrender) >>! In T228433#8794700, @JoKalliauer wrote: > @Umherirrender ; You have to compare the PNG not the S... [17:27:39] (03PS4) 10Zabe: Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) [17:28:43] jouncebot: nowandnext [17:28:43] For the next 0 hour(s) and 31 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1700) [17:28:43] For the next 0 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1700) [17:28:43] In 0 hour(s) and 31 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1800) [17:28:50] (03PS2) 10Dzahn: mariadb::generic_server: change default datadir path [puppet] - 10https://gerrit.wikimedia.org/r/909788 (https://phabricator.wikimedia.org/T329571) [17:29:14] (03PS4) 10Subramanya Sastry: Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [17:29:18] (03PS2) 10Dzahn: phorge: add parameter for db_datadir and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909787 (https://phabricator.wikimedia.org/T329571) [17:30:00] (03PS1) 10Zabe: Add messages for Fante Wikipedia (fatwiki) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910494 (https://phabricator.wikimedia.org/T335016) [17:30:52] (03PS1) 10Zabe: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910495 [17:31:04] (03PS1) 10Zabe: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910496 [17:31:33] (03PS6) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [17:32:09] (03PS2) 10Zabe: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910496 [17:32:16] (03PS2) 10Zabe: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910495 [17:32:23] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:33:34] (03PS7) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [17:34:08] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:34:11] (03CR) 10Zabe: [C: 03+2] Add messages for Fante Wikipedia (fatwiki) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910494 (https://phabricator.wikimedia.org/T335016) (owner: 10Zabe) [17:34:25] (03CR) 10Zabe: [C: 03+2] Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910496 (owner: 10Zabe) [17:34:29] (03CR) 10Zabe: [C: 03+2] Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910495 (owner: 10Zabe) [17:35:36] (03CR) 10Dzahn: "yea, so.. since the quickdatacopy class needs to be on the source host (or on all hosts), it's probably true that the profile should be in" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:37:21] (03CR) 10Dzahn: "then again.. "source" only means "wherever the rsyncd runs" and you can still either pull from or push to the rsyncd.. that's why source/d" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:37:23] (03CR) 10Zabe: [C: 03+2] Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) (owner: 10Zabe) [17:37:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [17:38:12] (03Merged) 10jenkins-bot: Initial configuration for kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910559 (https://phabricator.wikimedia.org/T334730) (owner: 10Zabe) [17:38:44] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) 05Open→03Resolved p:05Medium→03Lowest a:03JoKalliauer I think it is diffciult to reproduce i... [17:39:11] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [17:39:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [17:39:46] !log create Wiktionary Tyap # T334730 [17:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:51] T334730: Create Wiktionary Tyap - https://phabricator.wikimedia.org/T334730 [17:39:57] !log zabe@deploy2002 Started scap: create kcgwiktionary (T334730) [17:40:42] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [17:41:23] !log zabe@deploy2002 zabe: create kcgwiktionary (T334730) synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [17:42:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [17:45:55] (03PS8) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [17:46:38] (03CR) 10Krinkle: [C: 03+1] "LGTM. Needs good testing on mwdebug for affected wikis both as main and as external DB. [[m:Special:CentralAuth]] might be a good way to t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 (owner: 10Aaron Schulz) [17:48:06] !log zabe@deploy2002 Finished scap: create kcgwiktionary (T334730) (duration: 08m 08s) [17:48:11] T334730: Create Wiktionary Tyap - https://phabricator.wikimedia.org/T334730 [17:49:12] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910563 (https://phabricator.wikimedia.org/T321309) [17:49:36] (03CR) 10CI reject: [V: 04-1] hiera: lvs/balancer: unify hiera post bullseye upgrade (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910563 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:49:45] rude [17:49:59] (03Merged) 10jenkins-bot: Add messages for Fante Wikipedia (fatwiki) [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910494 (https://phabricator.wikimedia.org/T335016) (owner: 10Zabe) [17:50:02] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910496 (owner: 10Zabe) [17:51:03] (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910563 (https://phabricator.wikimedia.org/T321309) [17:52:24] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910495 (owner: 10Zabe) [17:52:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40788/console" [puppet] - 10https://gerrit.wikimedia.org/r/910563 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:53:05] !log zabe@deploy2002 Started scap: Backport for [[gerrit:910494|Add messages for Fante Wikipedia (fatwiki) (T335016)]], [[gerrit:910496|Localisation updates from https://translatewiki.net.]], [[gerrit:910495|Localisation updates from https://translatewiki.net.]] [17:53:11] T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016 [17:54:01] !log disable puppet in A:lvs and A:eqiad to test CR 910563 [17:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910563 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:57:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [17:58:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [17:59:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:00:04] jnuche and ^demon: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T1800). Please do the needful. [18:00:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:00:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS bullseye [18:00:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host bast2003.wikimedia.org with OS bullseye completed: - bast2003 (**PASS**)... [18:01:20] !log enable puppet and run agent in A:lvs and A:eqiad CR 910563 [18:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:42] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10wiki_willy) a:03Jclark-ctr [18:02:00] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10wiki_willy) a:03Jclark-ctr [18:03:22] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10wiki_willy) a:03RobH [18:03:47] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10wiki_willy) a:03RobH [18:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:04:28] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10wiki_willy) a:03RobH [18:05:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10Papaul) [18:05:11] !log zabe@deploy2002 zabe: Backport for [[gerrit:910494|Add messages for Fante Wikipedia (fatwiki) (T335016)]], [[gerrit:910496|Localisation updates from https://translatewiki.net.]], [[gerrit:910495|Localisation updates from https://translatewiki.net.]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [18:05:17] T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016 [18:05:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10Papaul) 05Open→03Resolved a:03Papaul @MoritzMuehlenhoff all yours [18:07:07] 10ops-codfw: codfw:sretest2001 for Infrastructure - https://phabricator.wikimedia.org/T320524 (10Papaul) 05Open→03Resolved Server has be put back to decom inventory [18:08:11] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/909738/40757/" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [18:10:52] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bullseye [18:11:04] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10wiki_willy) a:03RobH [18:11:10] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10RobH) This currently shows online via icinga and test ping [18:11:22] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10RobH) 05Open→03Resolved [18:11:35] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334783 (10wiki_willy) a:03RobH [18:16:42] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10RobH) 05Resolved→03Open {F36957729} {F36957730} [18:17:04] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:910494|Add messages for Fante Wikipedia (fatwiki) (T335016)]], [[gerrit:910496|Localisation updates from https://translatewiki.net.]], [[gerrit:910495|Localisation updates from https://translatewiki.net.]] (duration: 23m 58s) [18:17:10] T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016 [18:17:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10RobH) [18:17:32] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:15] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10RobH) 05Open→03Resolved asw1-eqsin.mgmt is also online ` robh@~$ ssh asw1-eqsin.mgmt.eqsin.wmnet --- JUNOS 20.3R2-S1.2 built 2021-05-20 14:12:14 UTC {master:0} robh@asw1-eqsin `> [18:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:20:57] (03CR) 10Herron: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [18:21:27] (03CR) 10Dzahn: [C: 03+2] phorge: add parameter for db_datadir and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909787 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [18:21:33] (03PS3) 10Dzahn: phorge: add parameter for db_datadir and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909787 (https://phabricator.wikimedia.org/T329571) [18:22:11] (03PS2) 10Dzahn: phabricator: add parameter for db_datadir in cloud and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909786 (https://phabricator.wikimedia.org/T329571) [18:22:32] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Jclark-ctr Thanks! I can verify that worked and I have been able to start building out the hosts. [18:24:07] (03PS1) 10Zabe: Initial configuration for fatwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910565 (https://phabricator.wikimedia.org/T335016) [18:25:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) a:05Cmjohnson→03Dwisehaupt [18:25:33] (03CR) 10Zabe: [C: 03+2] Initial configuration for fatwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910565 (https://phabricator.wikimedia.org/T335016) (owner: 10Zabe) [18:26:18] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [18:27:18] (03PS1) 10Ssingh: pybal/lvs: remove backward compatibility for buster [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) [18:28:02] (03Merged) 10jenkins-bot: Initial configuration for fatwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910565 (https://phabricator.wikimedia.org/T335016) (owner: 10Zabe) [18:28:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40789/console" [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:29:07] !log create Wikipedia Fante # T335016 [18:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:13] T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016 [18:29:29] !log zabe@deploy2002 Started scap: T335016 [18:29:35] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [18:30:28] (03CR) 10Jforrester: "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [18:30:48] !log zabe@deploy2002 zabe: T335016 synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [18:34:13] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:36:58] !log zabe@deploy2002 Finished scap: T335016 (duration: 07m 28s) [18:37:03] T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016 [18:37:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:42:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:44:24] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910528 [18:44:26] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910528 (owner: 10Zabe) [18:45:20] (03PS1) 10Bking: wdqs: use newer java profile for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) [18:45:35] (03PS1) 10Zabe: Disable VE as default editor on kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910569 (https://phabricator.wikimedia.org/T334730) [18:45:49] (03CR) 10CI reject: [V: 04-1] wdqs: use newer java profile for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:45:56] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:46:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:46:33] (03CR) 10Zabe: [C: 03+2] Disable VE as default editor on kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910569 (https://phabricator.wikimedia.org/T334730) (owner: 10Zabe) [18:47:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:47:23] (03PS2) 10Bking: wdqs: use newer java profile for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) [18:47:26] 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10Gehel) [18:47:37] (03CR) 10Ladsgroup: [C: 03+1] "don't worry about the merge conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [18:48:14] (03PS9) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [18:48:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:48:43] (03PS1) 10Zabe: db-production: Fix indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910570 [18:49:03] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40790/console" [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:49:06] (03CR) 10Zabe: [C: 03+2] db-production: Fix indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910570 (owner: 10Zabe) [18:49:49] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910528 (owner: 10Zabe) [18:49:51] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [18:49:53] (03Merged) 10jenkins-bot: Disable VE as default editor on kcgwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910569 (https://phabricator.wikimedia.org/T334730) (owner: 10Zabe) [18:50:00] (03Merged) 10jenkins-bot: db-production: Fix indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910570 (owner: 10Zabe) [18:50:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host backup2010.codfw.wmnet with OS bullseye [18:50:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host backup2010.codfw.wmnet with OS bullseye [18:50:33] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add backup2011 DNS entries - pt1979@cumin2002" [18:50:33] (03CR) 10Bking: [C: 03+2] wdqs: use newer java profile for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:50:37] (03CR) 10Gehel: [C: 03+1] wdqs: use newer java profile for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/910568 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:50:52] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bullseye [18:51:25] !log zabe@deploy2002 Started scap: Backport for [[gerrit:910569|Disable VE as default editor on kcgwiktionary (T334730)]], [[gerrit:910570|db-production: Fix indentation]], [[gerrit:910528|Update interwiki cache]] [18:51:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add backup2011 DNS entries - pt1979@cumin2002" [18:51:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:51:29] T334730: Create Wiktionary Tyap - https://phabricator.wikimedia.org/T334730 [18:52:39] !log zabe@deploy2002 zabe: Backport for [[gerrit:910569|Disable VE as default editor on kcgwiktionary (T334730)]], [[gerrit:910570|db-production: Fix indentation]], [[gerrit:910528|Update interwiki cache]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [18:53:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host backup2011.mgmt.codfw.wmnet with reboot policy FORCED [18:54:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul) [18:56:51] (03CR) 10Dzahn: [C: 03+2] "cloud-VPS only" [puppet] - 10https://gerrit.wikimedia.org/r/909787 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [18:56:54] (03PS1) 10Bking: wdqs: correct profile logic [puppet] - 10https://gerrit.wikimedia.org/r/910571 (https://phabricator.wikimedia.org/T331300) [18:57:15] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/910571 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:58:31] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:910569|Disable VE as default editor on kcgwiktionary (T334730)]], [[gerrit:910570|db-production: Fix indentation]], [[gerrit:910528|Update interwiki cache]] (duration: 07m 06s) [18:58:36] T334730: Create Wiktionary Tyap - https://phabricator.wikimedia.org/T334730 [18:59:23] (03CR) 10Subramanya Sastry: "Have to wait for the train to go out next week with Parosid changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [18:59:37] (03CR) 10Bking: [C: 03+2] wdqs: correct profile logic [puppet] - 10https://gerrit.wikimedia.org/r/910571 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:00:07] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:58] (03CR) 10Dzahn: [C: 03+2] "noop on phorge-1001" [puppet] - 10https://gerrit.wikimedia.org/r/909787 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:03:32] (03PS1) 10Bking: wdqs: use proper YAML variable type [puppet] - 10https://gerrit.wikimedia.org/r/910572 (https://phabricator.wikimedia.org/T331300) [19:05:42] (03PS2) 10Bking: wdqs: use proper YAML variable type [puppet] - 10https://gerrit.wikimedia.org/r/910572 (https://phabricator.wikimedia.org/T331300) [19:05:58] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: use proper YAML variable type [puppet] - 10https://gerrit.wikimedia.org/r/910572 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:07:11] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40792/console" [puppet] - 10https://gerrit.wikimedia.org/r/910572 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:07:24] (03CR) 10Bking: [C: 03+2] wdqs: use proper YAML variable type [puppet] - 10https://gerrit.wikimedia.org/r/910572 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:08:20] (03CR) 10Dzahn: [C: 04-1] "lookup in wrong profile" [puppet] - 10https://gerrit.wikimedia.org/r/909786 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:10:02] (03CR) 10Andrea Denisse: "PCC results: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40791/console" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [19:13:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10CCoxwell-WMF) [19:13:46] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:14:27] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10dr0ptp4kt) Approved for ldap/wmf access for Carrie. [19:15:48] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:16:17] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:16:40] !log bking@cumin1001 depool wdqs2012.codfw.wmnet for data xfer T331300 [19:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:45] T331300: Ensure WCQS/WDQS stack works on Bullseye and later - https://phabricator.wikimedia.org/T331300 [19:17:10] (03CR) 10Dzahn: "compiler output looks like the class is applied but nothing happens. I think this is because your hieradata is still in role/common/promet" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [19:21:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:23:54] (03PS1) 10Zabe: Initial configuration for kbdwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910575 (https://phabricator.wikimedia.org/T333266) [19:25:28] (03CR) 10Zabe: [C: 03+2] Initial configuration for kbdwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910575 (https://phabricator.wikimedia.org/T333266) (owner: 10Zabe) [19:26:17] (03Merged) 10jenkins-bot: Initial configuration for kbdwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910575 (https://phabricator.wikimedia.org/T333266) (owner: 10Zabe) [19:27:25] !log create Wiktionary Kabardian # T333266 [19:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:31] T333266: Create Wiktionary Kabardian - https://phabricator.wikimedia.org/T333266 [19:27:45] !log zabe@deploy2002 Started scap: T333266 [19:28:01] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:29:06] !log zabe@deploy2002 zabe: T333266 synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [19:30:37] (03PS10) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [19:31:00] (03PS1) 10Papaul: Add backup201[0-1] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/910576 (https://phabricator.wikimedia.org/T326965) [19:31:55] (03CR) 10Papaul: [C: 03+2] Add backup201[0-1] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/910576 (https://phabricator.wikimedia.org/T326965) (owner: 10Papaul) [19:32:31] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:33] TheresNoTime: are you planning to do T334394 ? [19:33:35] T334394: Create Wikinews Gungbe - https://phabricator.wikimedia.org/T334394 [19:34:50] !log zabe@deploy2002 Finished scap: T333266 (duration: 07m 04s) [19:34:56] T333266: Create Wiktionary Kabardian - https://phabricator.wikimedia.org/T333266 [19:36:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:37:29] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:39:40] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910529 [19:39:42] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910529 (owner: 10Zabe) [19:40:26] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910529 (owner: 10Zabe) [19:40:41] !log zabe@deploy2002 Started scap: Backport for [[gerrit:910529|Update interwiki cache]] [19:41:58] !log zabe@deploy2002 zabe: Backport for [[gerrit:910529|Update interwiki cache]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [19:47:29] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:910529|Update interwiki cache]] (duration: 06m 47s) [19:49:52] (03PS1) 10Nray: Fix TypeError: trigger.attr is not a function [skins/Vector] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910499 (https://phabricator.wikimedia.org/T335148) [19:54:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:57:41] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:58:18] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:00:05] brennen and TheresNoTime: Dear deployers, time to do the UTC late backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230420T2000). [20:00:05] nray: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] o/ I'm here [20:01:19] hey nray I can deploy [20:01:33] hi thcipriani , thank you for your help! [20:01:43] ^^ [20:04:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910499 (https://phabricator.wikimedia.org/T335148) (owner: 10Nray) [20:06:57] PROBLEM - Check systemd state on cloudbackup2001 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-tools-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:07] (03CR) 10Dzahn: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [20:15:14] (03PS1) 10Eevans: sessionstore: disable sessionstore1001 native transport [puppet] - 10https://gerrit.wikimedia.org/r/910584 (https://phabricator.wikimedia.org/T334754) [20:16:19] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/910584 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [20:19:33] (03CR) 10Eevans: [C: 03+2] sessionstore: disable sessionstore1001 native transport [puppet] - 10https://gerrit.wikimedia.org/r/910584 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [20:21:01] (03Merged) 10jenkins-bot: Fix TypeError: trigger.attr is not a function [skins/Vector] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/910499 (https://phabricator.wikimedia.org/T335148) (owner: 10Nray) [20:21:17] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:910499|Fix TypeError: trigger.attr is not a function (T335148)]] [20:21:24] T335148: TypeError: trigger.attr is not a function - https://phabricator.wikimedia.org/T335148 [20:22:00] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10aaron) Lag that is only in the secondary DC doe... [20:22:12] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [20:22:41] !log thcipriani@deploy2002 nray and thcipriani: Backport for [[gerrit:910499|Fix TypeError: trigger.attr is not a function (T335148)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:22:56] (03PS3) 10Dzahn: phabricator: add parameter for db_datadir in cloud and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909786 (https://phabricator.wikimedia.org/T329571) [20:23:11] ^ nray your change is live on mwdebug, check please! [20:23:28] thcipriani: yes, reviewing now [20:23:37] <3 [20:24:36] @thcipriani looks good, you can proceed! [20:25:26] cool, going live [20:26:58] (03CR) 10BryanDavis: toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [20:31:10] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:910499|Fix TypeError: trigger.attr is not a function (T335148)]] (duration: 09m 53s) [20:31:17] T335148: TypeError: trigger.attr is not a function - https://phabricator.wikimedia.org/T335148 [20:31:19] ^ nray live everywhere! [20:31:28] thank you for your help thcipriani ! [20:31:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:32:19] sure thing :) [20:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:33:42] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host sessionstore1001.eqiad.wmnet [20:34:55] (03PS1) 10Papaul: Remove backup201[0-1] it is already at line backup999 [puppet] - 10https://gerrit.wikimedia.org/r/910588 (https://phabricator.wikimedia.org/T326965) [20:35:15] PROBLEM - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is CRITICAL: connect to address 10.64.0.144 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:36:00] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is CRITICAL: connect to address 10.64.0.144 and port 9042: Connection refused eevans Testing in-progress https://phabricator.wikimedia.org/T93886 [20:36:32] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:36:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:37:23] (03CR) 10Papaul: [C: 03+2] Remove backup201[0-1] it is already at line backup999 [puppet] - 10https://gerrit.wikimedia.org/r/910588 (https://phabricator.wikimedia.org/T326965) (owner: 10Papaul) [20:37:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "cloud-only - https://puppet-compiler.wmflabs.org/output/909738/40795/" [puppet] - 10https://gerrit.wikimedia.org/r/909786 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [20:38:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. feel free to merge any time, it should be safe." [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [20:39:09] papaul: yes:) [20:40:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop in production" [puppet] - 10https://gerrit.wikimedia.org/r/909786 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [20:42:03] (03CR) 10Dzahn: [C: 03+2] "thank you" [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [20:46:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "on phabricator-prod-1001.devtools - confirmed it changed the datadir in /etc/my.cnf. then added override in Hiera in Horizon to change it " [puppet] - 10https://gerrit.wikimedia.org/r/909786 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [20:47:30] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [20:49:16] (03PS3) 10Ladsgroup: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 [20:49:19] (03CR) 10Ladsgroup: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 (owner: 10Ladsgroup) [20:49:50] (03Merged) 10jenkins-bot: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 (owner: 10Ladsgroup) [20:54:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [20:57:11] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [20:58:54] (03PS2) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [21:03:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2011.mgmt.codfw.wmnet with reboot policy FORCED [21:03:42] (03CR) 10Cmelo: Add the campaignevents-organize-events right to the campaignevents-beta-tester group, and remove it from the user group in the metawiki conf (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [21:03:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [21:04:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, should be safe to merge anytime." [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [21:05:21] (03CR) 10Dzahn: [C: 03+2] cloudgw: allow VMs to speak to new gerrit server (gerrit1003) [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [21:05:42] (03PS2) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [21:05:45] (03CR) 10Dzahn: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [21:08:40] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10sgrabarczuk) [21:11:11] (03PS1) 10Ladsgroup: mariadb: Add lists1003 grants for mailman dbs [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) [21:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:12:22] (03PS3) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [21:16:07] (03PS3) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [21:16:40] (03PS5) 10Zabe: Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) (owner: 10Samtar) [21:18:27] !log bking@cumin1001 depool wdqs2009 for data xfer T331300 [21:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:33] T331300: Ensure WCQS/WDQS stack works on Bullseye and later - https://phabricator.wikimedia.org/T331300 [21:18:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:19:02] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:19:17] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:21:05] 10SRE-swift-storage, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10putnik) @Ladsgroup Could you check https://commons.wikimedia.org/wiki/File:Wikitext-ru.svg ? -... [21:22:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:22:44] !log bking@cumin1001 repool wdqs2012 T331300 [21:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:56] (03PS1) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910599 (https://phabricator.wikimedia.org/T334088) [21:24:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:25:10] (03CR) 10Zabe: [C: 03+2] Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) (owner: 10Samtar) [21:25:37] (03CR) 10Cmelo: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [21:26:15] (03Merged) 10jenkins-bot: Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) (owner: 10Samtar) [21:26:56] !log create Wikinews Gungbe # T334394 [21:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:02] T334394: Create Wikinews Gungbe - https://phabricator.wikimedia.org/T334394 [21:27:19] !log zabe@deploy2002 Started scap: T334394 [21:28:05] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [21:28:35] !log zabe@deploy2002 zabe: T334394 synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:35:06] !log zabe@deploy2002 Finished scap: T334394 (duration: 07m 46s) [21:35:12] T334394: Create Wikinews Gungbe - https://phabricator.wikimedia.org/T334394 [21:35:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [21:35:42] (03PS1) 10Eevans: Revert "sessionstore: disable sessionstore1001 native transport" [puppet] - 10https://gerrit.wikimedia.org/r/910500 [21:36:25] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:27] (03CR) 10Eevans: [V: 03+2 C: 03+2] Revert "sessionstore: disable sessionstore1001 native transport" [puppet] - 10https://gerrit.wikimedia.org/r/910500 (owner: 10Eevans) [21:37:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:39:17] RECOVERY - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is OK: TCP OK - 0.032 second response time on 10.64.0.144 port 9042 https://phabricator.wikimedia.org/T93886 [21:39:38] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910530 [21:39:40] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910530 (owner: 10Zabe) [21:40:42] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910530 (owner: 10Zabe) [21:41:00] !log zabe@deploy2002 Started scap: Backport for [[gerrit:910530|Update interwiki cache]] [21:42:12] !log zabe@deploy2002 zabe: Backport for [[gerrit:910530|Update interwiki cache]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:42:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [21:45:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [21:45:21] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [21:47:26] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:910530|Update interwiki cache]] (duration: 06m 26s) [21:48:48] (03CR) 10Legoktm: "Ooh TIL about BindsTo. Yep, makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [21:48:57] (03PS2) 10Legoktm: codesearch: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [21:50:29] (03CR) 10Legoktm: [C: 03+2] codesearch: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [22:02:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:03:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:16:57] (03CR) 10Daimona Eaytoy: Metawiki: Enable $wgCampaignEventsEnableMultipleOrganizers in production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [22:17:10] (03CR) 10Bartosz Dziewoński: ":o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [22:17:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2011'] [22:18:03] (03CR) 10Daimona Eaytoy: [C: 04-1] "Patch is the same as PS1, there might have been some issue with gerrit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [22:18:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2011'] [22:18:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2011'] [22:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:21:43] (03PS4) 10Stevemunene: Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) [22:24:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2011'] [22:32:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Papaul) [22:32:43] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:46:55] (03PS1) 10Superpes15: [kcgwiktionary] Add a HD logo for vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910603 (https://phabricator.wikimedia.org/T335162) [22:48:44] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:58:10] (03PS1) 10Superpes15: [guwwikinews] Add a HD logo for vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910604 (https://phabricator.wikimedia.org/T335162) [22:59:14] (03PS2) 10Superpes15: [kcgwiktionary] Add a HD logo for vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910603 (https://phabricator.wikimedia.org/T335162) [23:42:52] (03PS1) 10Papaul: Fix netboot.cfg to match backup201[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/910606 (https://phabricator.wikimedia.org/T326965) [23:43:13] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910604 (https://phabricator.wikimedia.org/T335162) (owner: 10Superpes15) [23:44:15] (03CR) 10Papaul: [C: 03+2] Fix netboot.cfg to match backup201[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/910606 (https://phabricator.wikimedia.org/T326965) (owner: 10Papaul) [23:45:17] (03CR) 10Dzahn: [C: 03+1] Fix netboot.cfg to match backup201[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/910606 (https://phabricator.wikimedia.org/T326965) (owner: 10Papaul)