[00:06:06] (03CR) 10Krinkle: [C: 03+2] Profiler: Implement "Excimer UI" option for WikimediaDebug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [00:06:50] (03Merged) 10jenkins-bot: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [00:07:48] !log krinkle@deploy1002 Synchronized lib/: I4cfa4a2474b4e (duration: 06m 51s) [00:08:58] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:17] (03PS1) 10Stang: plwiki: Show language selector in main page header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920396 (https://phabricator.wikimedia.org/T336707) [00:14:20] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:15:26] !log krinkle@deploy1002 Synchronized wmf-config/: I4cfa4a2474b4e (duration: 06m 14s) [00:21:46] !log krinkle@deploy1002 Synchronized src/: I4cfa4a2474b4e (duration: 06m 01s) [00:26:41] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T336826 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:26:45] 10SRE, 10ops-eqiad: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10ops-monitoring-bot) [00:39:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920330 [00:39:42] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920330 (owner: 10TrainBranchBot) [00:56:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920330 (owner: 10TrainBranchBot) [01:00:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:32] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:02] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:44:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:07:12] (03PS1) 10Mvolz: Update zotero to use new translators fork [deployment-charts] - 10https://gerrit.wikimedia.org/r/920404 (https://phabricator.wikimedia.org/T336727) [05:12:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:12:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:04] !log Optimize s7 on dbstore1003 T336733 [05:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:08] T336733: dbstore1003 filling up - https://phabricator.wikimedia.org/T336733 [05:18:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:18:36] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:56] (03PS1) 10Marostegui: db1112: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/920564 (https://phabricator.wikimedia.org/T336332) [05:19:37] (03CR) 10Marostegui: [C: 03+2] db1112: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/920564 (https://phabricator.wikimedia.org/T336332) (owner: 10Marostegui) [05:20:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1112 from dbctl T336332', diff saved to https://phabricator.wikimedia.org/P48263 and previous config saved to /var/cache/conftool/dbconfig/20230517-052007-marostegui.json [05:20:11] T336332: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 [05:33:14] (03PS1) 10KartikMistry: Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 [05:35:51] (03CR) 10Santhosh: [C: 03+1] Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 (owner: 10KartikMistry) [05:40:37] (03PS1) 10Marostegui: mariadb: Decommission db1115 [puppet] - 10https://gerrit.wikimedia.org/r/920586 (https://phabricator.wikimedia.org/T336253) [05:41:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1115.eqiad.wmnet [05:46:52] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [05:47:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1115 [puppet] - 10https://gerrit.wikimedia.org/r/920586 (https://phabricator.wikimedia.org/T336253) (owner: 10Marostegui) [05:48:42] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1115.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [05:49:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1115.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [05:49:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:49:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1115.eqiad.wmnet [05:50:22] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1115.eqiad.wmnet - https://phabricator.wikimedia.org/T336253 (10Marostegui) 05Stalled→03Open a:05Marostegui→03Jclark-ctr [05:50:45] 10ops-eqiad, 10decommission-hardware: decommission db1115.eqiad.wmnet - https://phabricator.wikimedia.org/T336253 (10Marostegui) [05:50:47] 10ops-eqiad, 10decommission-hardware: decommission db1115.eqiad.wmnet - https://phabricator.wikimedia.org/T336253 (10Marostegui) Ready for DCOPs [05:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2096', diff saved to https://phabricator.wikimedia.org/P48264 and previous config saved to /var/cache/conftool/dbconfig/20230517-055310-root.json [05:55:12] (03CR) 10Volans: [C: 03+2] install_server: fix ztp-juniper script [puppet] - 10https://gerrit.wikimedia.org/r/920374 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [05:59:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48265 and previous config saved to /var/cache/conftool/dbconfig/20230517-055904-root.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0600) [06:00:19] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:01:48] !log restarted ferm on ms-be1047 [06:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:28] Does anyone mind if I deploy in this window? (zotero/citoid) [06:08:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:08:24] jouncebot: now [06:08:24] For the next 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0600) [06:08:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:43] (03CR) 10Volans: [C: 03+1] "LGTM, try to test in on -next if possible" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [06:14:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48267 and previous config saved to /var/cache/conftool/dbconfig/20230517-061409-root.json [06:14:20] (03CR) 10Mvolz: [C: 03+2] Update zotero to use new translators fork [deployment-charts] - 10https://gerrit.wikimedia.org/r/920404 (https://phabricator.wikimedia.org/T336727) (owner: 10Mvolz) [06:15:28] (03Merged) 10jenkins-bot: Update zotero to use new translators fork [deployment-charts] - 10https://gerrit.wikimedia.org/r/920404 (https://phabricator.wikimedia.org/T336727) (owner: 10Mvolz) [06:18:05] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [06:19:17] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [06:20:22] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [06:20:27] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [06:20:46] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [06:21:35] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [06:22:00] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [06:23:03] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:24:58] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/908882 (owner: 10PipelineBot) [06:25:48] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/908882 (owner: 10PipelineBot) [06:29:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48268 and previous config saved to /var/cache/conftool/dbconfig/20230517-062914-root.json [06:30:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:37:23] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [06:37:51] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [06:38:40] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [06:39:18] (03PS2) 10KartikMistry: Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 [06:39:18] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [06:39:52] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [06:40:19] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [06:42:05] (03PS1) 10Marostegui: db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920592 (https://phabricator.wikimedia.org/T335632) [06:42:25] 10SRE, 10Domains, 10Traffic: Mark Monitor administration panel (redirects for wikimedia.pl) - https://phabricator.wikimedia.org/T333827 (10Jacek_Broda_WMPL) Hi! Thanks very much, Dzahn! I will try to contact Legal team and ask them about it. Hope that will work. Best for you! [06:42:27] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:42:41] (03CR) 10Marostegui: [C: 03+2] db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920592 (https://phabricator.wikimedia.org/T335632) (owner: 10Marostegui) [06:43:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48269 and previous config saved to /var/cache/conftool/dbconfig/20230517-064313-root.json [06:43:37] (03PS7) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [06:43:44] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add ipoid grants [puppet] - 10https://gerrit.wikimedia.org/r/920194 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [06:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48270 and previous config saved to /var/cache/conftool/dbconfig/20230517-064419-root.json [06:45:04] (03PS3) 10KartikMistry: Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 [06:48:41] (03PS4) 10KartikMistry: Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 [06:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48271 and previous config saved to /var/cache/conftool/dbconfig/20230517-065817-root.json [06:59:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48272 and previous config saved to /var/cache/conftool/dbconfig/20230517-065923-root.json [07:00:04] Amir1, Urbanecm, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] * kart_ is here [07:01:22] (03CR) 10Slyngshede: [C: 03+2] Reconnect handling reworked. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919273 (owner: 10Slyngshede) [07:02:18] (03CR) 10Gmodena: [C: 03+2] "Changes LGTM. Really nice refactoring of the log file format." [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [07:03:02] (03Merged) 10jenkins-bot: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [07:05:46] * kart_ going ahead for deployment.. [07:06:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918922 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:07:18] (03Merged) 10jenkins-bot: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918922 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:07:46] (03CR) 10Gmodena: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [07:07:57] (03CR) 10Santhosh: [C: 04-1] Update MinT to 2023-05-17-052844-production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 (owner: 10KartikMistry) [07:08:58] Unexpected commit on /srv/mediawiki-staging -- Can anyone look into this? [07:09:04] !log kartik@deploy1002 Backport cancelled. [07:10:07] (03PS1) 10TrainBranchBot: Revert "Enable the new Special:Contribute page entry point for desktop on selected wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920625 [07:10:09] (03CR) 10TrainBranchBot: "kartik@deploy1002 created a revert of this change as Iffbb2cdd5dfd9c6ff4adf2973acbbd0ca365fa89" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918922 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:10:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920625 (owner: 10TrainBranchBot) [07:10:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 T336725', diff saved to https://phabricator.wikimedia.org/P48273 and previous config saved to /var/cache/conftool/dbconfig/20230517-071039-root.json [07:10:43] T336725: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725 [07:11:19] (03Merged) 10jenkins-bot: Revert "Enable the new Special:Contribute page entry point for desktop on selected wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920625 (owner: 10TrainBranchBot) [07:11:46] !log kartik@deploy1002 Started scap: Backport for [[gerrit:920625|Revert "Enable the new Special:Contribute page entry point for desktop on selected wikis"]] [07:11:49] (03PS1) 10Marostegui: db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920627 (https://phabricator.wikimedia.org/T336725) [07:11:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/920368 (owner: 10Herron) [07:12:20] (03CR) 10Marostegui: [C: 03+2] db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920627 (https://phabricator.wikimedia.org/T336725) (owner: 10Marostegui) [07:12:36] (03PS5) 10KartikMistry: Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 [07:13:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [07:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 3%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48274 and previous config saved to /var/cache/conftool/dbconfig/20230517-071322-root.json [07:13:25] !log kartik@deploy1002 trainbranchbot and kartik: Backport for [[gerrit:920625|Revert "Enable the new Special:Contribute page entry point for desktop on selected wikis"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48275 and previous config saved to /var/cache/conftool/dbconfig/20230517-071428-root.json [07:19:08] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:920625|Revert "Enable the new Special:Contribute page entry point for desktop on selected wikis"]] (duration: 07m 22s) [07:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:20:42] (03PS1) 10Gmodena: Revert "Add flink-app default log config and use it in page_content_change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920576 [07:21:58] Krinkle: "Profiler: Implement "Excimer UI" option for WikimediaDebug" seems undeployed on /srv/mediawiki-staging - can you take a look at it? [07:22:18] It is blocking config deployment for me. [07:23:44] (03PS1) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) [07:24:27] (03PS2) 10Hashar: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) [07:24:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:24:37] (03PS1) 10Giuseppe Lavagetto: prometheus/k8s: fix annotation name [puppet] - 10https://gerrit.wikimedia.org/r/920628 [07:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 for decommissioning', diff saved to https://phabricator.wikimedia.org/P48276 and previous config saved to /var/cache/conftool/dbconfig/20230517-072508-root.json [07:25:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [07:25:38] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] prometheus/k8s: fix annotation name [puppet] - 10https://gerrit.wikimedia.org/r/920628 (owner: 10Giuseppe Lavagetto) [07:26:17] (03PS1) 10Marostegui: db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920629 [07:26:45] (03CR) 10Marostegui: [C: 03+2] db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920629 (owner: 10Marostegui) [07:27:08] Anyone know if it is possible to avoid undeployed changes on /srv/mediawiki-config and continue config deployment? [07:28:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 4%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48277 and previous config saved to /var/cache/conftool/dbconfig/20230517-072827-root.json [07:29:00] (03CR) 10Ayounsi: dhcp: reword some exception messages (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [07:29:10] (03CR) 10Gmodena: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [07:30:32] (03CR) 10Ayounsi: [C: 03+2] users: update SSH key for dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/920351 (https://phabricator.wikimedia.org/T336769) (owner: 10Dwisehaupt) [07:31:01] (03CR) 10Ayounsi: [C: 03+2] users: change my own SSH key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920301 (https://phabricator.wikimedia.org/T336769) (owner: 10Jgreen) [07:31:10] (03Merged) 10jenkins-bot: users: update SSH key for dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/920351 (https://phabricator.wikimedia.org/T336769) (owner: 10Dwisehaupt) [07:31:16] (03CR) 10Ayounsi: [C: 03+2] users: Update my SSH key to a ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/920295 (https://phabricator.wikimedia.org/T336769) (owner: 10JMeybohm) [07:31:36] (03Merged) 10jenkins-bot: users: change my own SSH key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920301 (https://phabricator.wikimedia.org/T336769) (owner: 10Jgreen) [07:31:49] (03Merged) 10jenkins-bot: users: Update my SSH key to a ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/920295 (https://phabricator.wikimedia.org/T336769) (owner: 10JMeybohm) [07:34:43] (03PS3) 10Giuseppe Lavagetto: shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) [07:35:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'clear' for AS: 37468 [07:36:28] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10Peachey88) [07:36:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 37468 [07:37:43] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10Peachey88) [07:37:45] 10SRE, 10ops-eqiad: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336814 (10Peachey88) [07:38:35] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/908604/1832/" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [07:39:51] hashar: Do you know what we can do about undeployed changes on /srv/mediawiki-config repository? [07:41:04] kart_: yeah sure [07:41:38] it is a bit tedious, one has to check what the change is about and check the related Gerrit change to see whether there are any indication [07:41:52] next step is to find whether the change actually got deployed which can be done by looking at /srv/mediawiki [07:41:58] and asking the author / CR+2 author [07:42:12] if it never got deployed, I think we should revert it from Gerrit [07:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48278 and previous config saved to /var/cache/conftool/dbconfig/20230517-074332-root.json [07:43:35] are you wondering about Timo change? [07:43:45] or is that the change you have merged and the nreverted? [07:43:49] (03PS4) 10Giuseppe Lavagetto: shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) [07:45:07] (03PS5) 10Giuseppe Lavagetto: shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) [07:45:09] (03CR) 10TChin: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [07:45:22] (03CR) 10Ayounsi: [C: 03+1] Add EVPN protocol config for enabled L3 switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [07:45:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on krb1001.eqiad.wmnet with reason: Update to Bullseye [07:45:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb1001.eqiad.wmnet with reason: Update to Bullseye [07:46:01] 10SRE, 10Infrastructure-Foundations: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0862bbee-318e-4c78-92cc-9304bf025af3) set by jmm@cumin2002 for 2:00:00 on 1 host(s) and their services with reason: Update to Bullsey... [07:46:57] hashar: Timo's change. [07:47:08] hashar: I had to undeploy my change. [07:48:58] !log upgrading krb1001 to Bullseye T331695 [07:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:02] T331695: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 [07:49:38] (03PS1) 10Elukey: role::ml_k8s::staging::worker: fix lvs config [puppet] - 10https://gerrit.wikimedia.org/r/920630 [07:51:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [07:51:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41209/console" [puppet] - 10https://gerrit.wikimedia.org/r/920630 (owner: 10Elukey) [07:51:43] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::staging::worker: fix lvs config [puppet] - 10https://gerrit.wikimedia.org/r/920630 (owner: 10Elukey) [07:51:58] hashar: it is available in /srv/mediawiki and by looking at T291015, it isn't deployed yet (Merged today only: https://phabricator.wikimedia.org/T291015#8857635) [07:51:59] T291015: Add per-request flamegraph option to WikimediaDebug - https://phabricator.wikimedia.org/T291015 [07:52:11] (03Merged) 10jenkins-bot: shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [07:57:10] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [07:57:20] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [07:58:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48283 and previous config saved to /var/cache/conftool/dbconfig/20230517-075836-root.json [07:59:54] (03Merged) 10jenkins-bot: shellbox: bump chart minor version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920631 (owner: 10Giuseppe Lavagetto) [08:00:05] dancy and hashar: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0800). nyaa~ [08:00:05] (03CR) 10Santhosh: [C: 03+1] Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 (owner: 10KartikMistry) [08:03:45] (03CR) 10Klausman: [C: 03+1] service::catalog: add initial config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920218 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [08:04:09] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:05:00] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:08:27] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:08:53] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:13:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ was alerting with: > cloudswift1001 (WMF5069) Device i... [08:13:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48284 and previous config saved to /var/cache/conftool/dbconfig/20230517-081341-root.json [08:17:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [08:22:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [08:24:31] (03CR) 10KartikMistry: Update MinT to 2023-05-17-052844-production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 (owner: 10KartikMistry) [08:28:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48285 and previous config saved to /var/cache/conftool/dbconfig/20230517-082846-root.json [08:29:48] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks :)" [homer/public] - 10https://gerrit.wikimedia.org/r/920632 (owner: 10Ayounsi) [08:33:55] (03PS1) 10Giuseppe Lavagetto: shellbox: properly define value to switch to named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/920635 [08:34:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: properly define value to switch to named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/920635 (owner: 10Giuseppe Lavagetto) [08:35:03] (03CR) 10Ayounsi: [C: 03+2] Fix missing semi-colon [homer/public] - 10https://gerrit.wikimedia.org/r/920632 (owner: 10Ayounsi) [08:35:05] (03Merged) 10jenkins-bot: shellbox: properly define value to switch to named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/920635 (owner: 10Giuseppe Lavagetto) [08:35:53] (03Merged) 10jenkins-bot: Fix missing semi-colon [homer/public] - 10https://gerrit.wikimedia.org/r/920632 (owner: 10Ayounsi) [08:36:32] (03PS1) 10Giuseppe Lavagetto: prometheus: add the named port scraping option to all wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/920636 [08:36:54] 10SRE, 10Infrastructure-Foundations, 10netops: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [08:37:16] (03PS2) 10Giuseppe Lavagetto: prometheus: add the named port scraping option to all wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/920636 (https://phabricator.wikimedia.org/T271822) [08:43:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48287 and previous config saved to /var/cache/conftool/dbconfig/20230517-084350-root.json [08:45:18] (03PS1) 10Muehlenhoff: Add krb1001 back to KDCs exposed to Kerberos clients and drop krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/920637 (https://phabricator.wikimedia.org/T331695) [08:46:36] (03CR) 10DCausse: [C: 03+1] Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [08:50:13] (03PS1) 10Btullis: users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 [08:58:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling after a crash', diff saved to https://phabricator.wikimedia.org/P48288 and previous config saved to /var/cache/conftool/dbconfig/20230517-085855-root.json [09:00:43] (03PS3) 10Giuseppe Lavagetto: prometheus: add the named port scraping option to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/920636 (https://phabricator.wikimedia.org/T271822) [09:03:24] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add the named port scraping option to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/920636 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [09:08:12] (03CR) 10Ayounsi: [C: 03+1] dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696) (owner: 10Volans) [09:12:25] (03PS1) 10Arturo Borrero Gonzalez: cloud: wmf-auto-restart: exclude NFS filesystems [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) [09:14:04] (03PS1) 10MVernon: swift: remove ms-be204[0-3] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/920645 (https://phabricator.wikimedia.org/T335280) [09:14:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1220 cleaning gtid_domain_id', diff saved to https://phabricator.wikimedia.org/P48289 and previous config saved to /var/cache/conftool/dbconfig/20230517-091407-root.json [09:16:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48290 and previous config saved to /var/cache/conftool/dbconfig/20230517-091606-root.json [09:16:26] (03CR) 10MVernon: "I checked these nodes were drained thus:" [puppet] - 10https://gerrit.wikimedia.org/r/920645 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon) [09:16:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:17:40] (03CR) 10Muehlenhoff: [C: 03+2] Add krb1001 back to KDCs exposed to Kerberos clients and drop krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/920637 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [09:18:44] 10SRE-swift-storage: Bring ms-be207[0-3] into the rings - https://phabricator.wikimedia.org/T335278 (10MatthewVernon) 05Open→03Resolved These nodes are now fully loaded in the rings. [09:18:46] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [09:19:43] 10SRE-swift-storage, 10Patch-For-Review: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 (10MatthewVernon) Nodes now fully drained, next step is to remove from rings (then they can be decommissioned). [09:20:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Urgent: disk failed in ms-be1063 - https://phabricator.wikimedia.org/T336778 (10MatthewVernon) [09:20:24] 10SRE-swift-storage: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 (10MatthewVernon) [09:20:26] 10SRE-swift-storage: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 (10MatthewVernon) [09:20:49] (03CR) 10DCausse: mediawiki-page-content-change-enrichment: enable HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [09:23:38] ACKNOWLEDGEMENT - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,systemd-timesyncd.service,wmf_auto_restart_systemd-timesyncd.service MVernon This node about to be decommissioned (and has 0 weight in the swift rings) - T335280 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:38] ACKNOWLEDGEMENT - Disk space on ms-be2042 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error MVernon This node about to be decommissioned (and has 0 weight in the swift rings) - T335280 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2042&var-datasource=codfw+prometheus/ops [09:25:29] (03PS1) 10Samtar: diff: Only show inline legend for text slot [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920578 (https://phabricator.wikimedia.org/T336481) [09:25:47] (03CR) 10Elukey: [C: 03+1] "LGTM! Let's wait for Hugh's review since we changed the changeprop's chart, if everything looks good for them too we'll merge and start te" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [09:26:17] (03PS1) 10Samtar: onDifferenceEngineBeforeDiffTable: Return early on Special pages [extensions/VisualEditor] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920579 (https://phabricator.wikimedia.org/T336582) [09:27:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:41] (03CR) 10Elukey: [C: 03+2] service::catalog: add initial config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920218 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [09:29:00] jouncebot: nowandnext [09:29:00] For the next 0 hour(s) and 30 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0800) [09:29:00] In 0 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1000) [09:29:26] (03CR) 10David Caro: cloud: wmf-auto-restart: exclude NFS filesystems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:30:56] (03PS2) 10Elukey: service::catalog: switch k8s-ingress-ml-serve to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920219 (https://phabricator.wikimedia.org/T336726) [09:31:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48291 and previous config saved to /var/cache/conftool/dbconfig/20230517-093110-root.json [09:31:13] hello! o/ I have two patches I'd like to backport to `1.41.0-wmf.9` (https://gerrit.wikimedia.org/r/920578 and https://gerrit.wikimedia.org/r/920579) — should I wait until after the train rolls in 30 minutes? [09:32:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41210/console" [puppet] - 10https://gerrit.wikimedia.org/r/920219 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [09:32:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:33:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [09:36:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] service::catalog: switch k8s-ingress-ml-serve to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920219 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [09:39:07] !log roll restart pybal on lvs2010, lvs2009, lvs1020, lvs1019 to pick up a VIP (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/920219) - T336726 [09:39:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance [09:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:12] T336726: Create k8s ingress config and VIP for ores-legacy - https://phabricator.wikimedia.org/T336726 [09:39:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance [09:39:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2029 (T335845)', diff saved to https://phabricator.wikimedia.org/P48292 and previous config saved to /var/cache/conftool/dbconfig/20230517-093928-ladsgroup.json [09:39:48] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-ml-serve_31443: Servers kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wi [09:39:48] ikimedia.org/wiki/PyBal [09:41:08] (03PS7) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [09:41:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus: add the named port scraping option to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/920636 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [09:42:28] (03CR) 10Samtar: "recheck" [extensions/VisualEditor] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920579 (https://phabricator.wikimedia.org/T336582) (owner: 10Samtar) [09:43:08] (03PS2) 10Arturo Borrero Gonzalez: cloud: wmf-auto-restart: exclude NFS filesystems [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) [09:43:10] (03PS1) 10Arturo Borrero Gonzalez: profile::auto_restarts: allow the systemd timer to not be installed [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) [09:43:31] (03PS1) 10Elukey: service::catalog: fix lvs config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920649 [09:44:50] (03CR) 10CI reject: [V: 04-1] onDifferenceEngineBeforeDiffTable: Return early on Special pages [extensions/VisualEditor] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920579 (https://phabricator.wikimedia.org/T336582) (owner: 10Samtar) [09:44:59] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41211/console" [puppet] - 10https://gerrit.wikimedia.org/r/920649 (owner: 10Elukey) [09:45:17] (03PS2) 10Arturo Borrero Gonzalez: profile::auto_restarts: allow the systemd timer to not be installed [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) [09:46:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48293 and previous config saved to /var/cache/conftool/dbconfig/20230517-094615-root.json [09:48:04] (03CR) 10Elukey: [V: 03+1 C: 03+2] service::catalog: fix lvs config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920649 (owner: 10Elukey) [09:50:46] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [09:50:58] (03CR) 10Arturo Borrero Gonzalez: cloud: wmf-auto-restart: exclude NFS filesystems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:33] (03CR) 10Arturo Borrero Gonzalez: "may not be required per https://gerrit.wikimedia.org/r/920648" [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:52:48] (03PS1) 10Marostegui: control-mariadb-client-10.4-bullseye: Update to 10.4.29 [software] - 10https://gerrit.wikimedia.org/r/920651 (https://phabricator.wikimedia.org/T336462) [09:52:54] (03CR) 10CI reject: [V: 04-1] control-mariadb-client-10.4-bullseye: Update to 10.4.29 [software] - 10https://gerrit.wikimedia.org/r/920651 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui) [09:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P48294 and previous config saved to /var/cache/conftool/dbconfig/20230517-095301-ladsgroup.json [09:54:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: Maintenance [09:54:35] (03PS1) 10Marostegui: control-mariadb-client-10.4-bullseye: Update to 10.4.29 [software] - 10https://gerrit.wikimedia.org/r/920652 (https://phabricator.wikimedia.org/T336462) [09:54:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: Maintenance [09:54:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2026 (T335845)', diff saved to https://phabricator.wikimedia.org/P48295 and previous config saved to /var/cache/conftool/dbconfig/20230517-095443-ladsgroup.json [09:54:51] (03Abandoned) 10Marostegui: control-mariadb-client-10.4-bullseye: Update to 10.4.29 [software] - 10https://gerrit.wikimedia.org/r/920651 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui) [09:54:57] (03CR) 10CI reject: [V: 04-1] control-mariadb-client-10.4-bullseye: Update to 10.4.29 [software] - 10https://gerrit.wikimedia.org/r/920652 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui) [09:55:35] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/920652 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui) [09:56:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Manuel) [09:56:28] (03CR) 10Samtar: "recheck" [extensions/VisualEditor] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920579 (https://phabricator.wikimedia.org/T336582) (owner: 10Samtar) [09:57:34] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4-bullseye: Update to 10.4.29 [software] - 10https://gerrit.wikimedia.org/r/920652 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui) [09:58:46] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [09:59:30] (03PS1) 10Marostegui: control-mariadb-client-10.4: Remove file [software] - 10https://gerrit.wikimedia.org/r/920653 [09:59:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026 (T335845)', diff saved to https://phabricator.wikimedia.org/P48296 and previous config saved to /var/cache/conftool/dbconfig/20230517-095936-ladsgroup.json [09:59:58] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1000) [10:00:36] (03CR) 10CI reject: [V: 04-1] control-mariadb-client-10.4: Remove file [software] - 10https://gerrit.wikimedia.org/r/920653 (owner: 10Marostegui) [10:01:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48297 and previous config saved to /var/cache/conftool/dbconfig/20230517-100120-root.json [10:01:51] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/920653 (owner: 10Marostegui) [10:03:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:05] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/920653 (owner: 10Marostegui) [10:07:32] (03PS1) 10Arturo Borrero Gonzalez: profile::auto_restarts::service: add some spec tests [puppet] - 10https://gerrit.wikimedia.org/r/920654 [10:08:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P48298 and previous config saved to /var/cache/conftool/dbconfig/20230517-100805-ladsgroup.json [10:08:10] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [10:08:19] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [10:08:33] <_joe_> jouncebot: now [10:08:33] For the next 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1000) [10:08:39] <_joe_> oh perfect then [10:08:50] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [10:09:21] (03PS2) 10Elukey: service::catalog: switch k8s-ingress-ml-serve to production [puppet] - 10https://gerrit.wikimedia.org/r/920220 (https://phabricator.wikimedia.org/T336726) [10:09:32] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [10:10:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Manuel) [10:10:13] (03CR) 10CI reject: [V: 04-1] profile::auto_restarts::service: add some spec tests [puppet] - 10https://gerrit.wikimedia.org/r/920654 (owner: 10Arturo Borrero Gonzalez) [10:10:23] (03CR) 10Hnowlan: [C: 03+1] "Looks okay to me from a changeprop standpoint!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:11:39] (03CR) 10Elukey: [C: 03+2] service::catalog: switch k8s-ingress-ml-serve to production [puppet] - 10https://gerrit.wikimedia.org/r/920220 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [10:13:49] (03PS1) 10Ayounsi: interface validator: workaround bug with count_ipaddresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/920656 (https://phabricator.wikimedia.org/T310590) [10:14:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026', diff saved to https://phabricator.wikimedia.org/P48299 and previous config saved to /var/cache/conftool/dbconfig/20230517-101442-ladsgroup.json [10:14:47] (03PS1) 10Jcrespo: control-mariadb-client-10.4: Remove file [software] - 10https://gerrit.wikimedia.org/r/920657 [10:16:02] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [10:16:24] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [10:16:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48300 and previous config saved to /var/cache/conftool/dbconfig/20230517-101624-root.json [10:17:08] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [10:17:16] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [10:17:38] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Odd issue I think best to work-around like this without getting too much into the weeds." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/920656 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [10:17:42] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4: Remove file [software] - 10https://gerrit.wikimedia.org/r/920657 (owner: 10Jcrespo) [10:18:23] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [10:19:03] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [10:19:22] (03CR) 10Oleksandr Tsyba (WMDE): [C: 03+1] Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920296 (https://phabricator.wikimedia.org/T336760) (owner: 10Guergana Tzatchkova) [10:19:30] (03CR) 10David Caro: "have not finished, some comments though" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [10:21:48] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Lea_WMDE) As @Manuel 's manager, I support this request :) [10:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P48301 and previous config saved to /var/cache/conftool/dbconfig/20230517-102310-ladsgroup.json [10:23:20] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:24:30] TheresNoTime, urbanecm: are you running the backport window later? I have two patches that need to go in together... I haven't done this with `scap backport`, can I just provide two IDs and it will just work? [10:24:55] duesen: yes, `scap backport 123 456` will do what you would expect. [10:24:57] I need to test them on the debug box together, I already know that the first one alone is broken... [10:25:08] urbanecm: excellent, thank you! [10:25:47] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [10:25:58] (03PS1) 10Filippo Giunchedi: cadvisor: add explicity metrics enable [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) [10:26:00] (03PS1) 10Filippo Giunchedi: cadvisor: disable percpu and cpuLoad metrics [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) [10:26:08] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [10:26:19] (03CR) 10CI reject: [V: 04-1] cadvisor: add explicity metrics enable [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [10:26:27] (03CR) 10CI reject: [V: 04-1] cadvisor: disable percpu and cpuLoad metrics [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [10:27:58] :( [10:29:17] (03PS2) 10Filippo Giunchedi: cadvisor: add explicity metrics enable [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) [10:29:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:19] (03PS2) 10Filippo Giunchedi: cadvisor: disable percpu and cpuLoad metrics [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) [10:29:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026', diff saved to https://phabricator.wikimedia.org/P48302 and previous config saved to /var/cache/conftool/dbconfig/20230517-102948-ladsgroup.json [10:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48303 and previous config saved to /var/cache/conftool/dbconfig/20230517-103129-root.json [10:33:29] (03CR) 10Slyngshede: [C: 03+2] Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [10:33:32] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [10:34:43] (03PS2) 10Elukey: Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) [10:35:34] (03CR) 10CI reject: [V: 04-1] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [10:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P48304 and previous config saved to /var/cache/conftool/dbconfig/20230517-103815-ladsgroup.json [10:44:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026 (T335845)', diff saved to https://phabricator.wikimedia.org/P48305 and previous config saved to /var/cache/conftool/dbconfig/20230517-104454-ladsgroup.json [10:45:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: Maintenance [10:45:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: Maintenance [10:45:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2033 (T335845)', diff saved to https://phabricator.wikimedia.org/P48306 and previous config saved to /var/cache/conftool/dbconfig/20230517-104519-ladsgroup.json [10:46:07] (03PS1) 10Hnowlan: service: move rest-gateway to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920664 (https://phabricator.wikimedia.org/T329049) [10:49:22] (03CR) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [10:49:59] (03PS3) 10Slyngshede: mgmt module [software/bitu] - 10https://gerrit.wikimedia.org/r/918245 [10:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:50:01] (03PS1) 10Slyngshede: Offboarding: Allow managers to offboard users. [software/bitu] - 10https://gerrit.wikimedia.org/r/920665 (https://phabricator.wikimedia.org/T335476) [10:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2033 (T335845)', diff saved to https://phabricator.wikimedia.org/P48307 and previous config saved to /var/cache/conftool/dbconfig/20230517-105012-ladsgroup.json [10:52:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:57:18] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [10:57:29] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [10:57:41] <_joe_> jouncebot: next [10:57:41] In 2 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1300) [10:57:51] <_joe_> oh ok so I have time to finish netbox [10:58:11] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [10:58:31] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [10:59:01] (03PS1) 10Hnowlan: service: move rest-gateway to production [puppet] - 10https://gerrit.wikimedia.org/r/920667 (https://phabricator.wikimedia.org/T329049) [10:59:26] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [10:59:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [10:59:46] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [10:59:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [10:59:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1028 (T335845)', diff saved to https://phabricator.wikimedia.org/P48308 and previous config saved to /var/cache/conftool/dbconfig/20230517-105957-ladsgroup.json [11:00:20] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:31] (03PS1) 10Jaime Nuche: WIP [puppet] - 10https://gerrit.wikimedia.org/r/920669 [11:00:48] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [11:00:58] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [11:01:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1026.eqiad.wmnet with reason: Maintenance [11:01:16] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [11:01:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1026.eqiad.wmnet with reason: Maintenance [11:01:25] (03PS1) 10Slyngshede: Login: Allow landing page to be configured. [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 [11:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1026 (T335845)', diff saved to https://phabricator.wikimedia.org/P48309 and previous config saved to /var/cache/conftool/dbconfig/20230517-110130-ladsgroup.json [11:01:34] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [11:01:52] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:01:56] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:58] (03CR) 10JMeybohm: [C: 03+1] Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [11:02:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:02:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: Maintenance [11:02:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: Maintenance [11:02:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2034 (T335845)', diff saved to https://phabricator.wikimedia.org/P48310 and previous config saved to /var/cache/conftool/dbconfig/20230517-110251-ladsgroup.json [11:03:19] * kart_ is updating MinT [11:04:01] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [11:04:11] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [11:05:02] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:14] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2033', diff saved to https://phabricator.wikimedia.org/P48311 and previous config saved to /var/cache/conftool/dbconfig/20230517-110518-ladsgroup.json [11:05:31] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:06:32] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:44] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [11:07:03] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [11:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2034 (T335845)', diff saved to https://phabricator.wikimedia.org/P48312 and previous config saved to /var/cache/conftool/dbconfig/20230517-110745-ladsgroup.json [11:07:52] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 (owner: 10KartikMistry) [11:08:24] (03PS1) 10Bartosz Dziewoński: Define $maintClass in maintenance script for compatibility [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920582 (https://phabricator.wikimedia.org/T317375) [11:09:07] (03Merged) 10jenkins-bot: Update MinT to 2023-05-17-052844-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920565 (owner: 10KartikMistry) [11:09:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:09:56] (03PS2) 10Slyngshede: Login: Allow landing page to be configured. [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 [11:10:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026 (T335845)', diff saved to https://phabricator.wikimedia.org/P48313 and previous config saved to /var/cache/conftool/dbconfig/20230517-111020-ladsgroup.json [11:10:37] (03PS2) 10JMeybohm: envoyproxy: Add python 3.11 to tox [puppet] - 10https://gerrit.wikimedia.org/r/916499 (https://phabricator.wikimedia.org/T300324) [11:10:51] (03CR) 10JMeybohm: [C: 03+2] envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/916498 (https://phabricator.wikimedia.org/T303230) (owner: 10JMeybohm) [11:10:55] (03PS3) 10Slyngshede: Login: Allow landing page to be configured. [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 [11:11:55] (03PS1) 10Muehlenhoff: Remove LDAP access for jrabah [puppet] - 10https://gerrit.wikimedia.org/r/920672 [11:12:07] (03PS4) 10Slyngshede: Login: Allow landing page to be configured. [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 [11:13:27] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:13:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for jrabah [puppet] - 10https://gerrit.wikimedia.org/r/920672 (owner: 10Muehlenhoff) [11:13:40] (03CR) 10Slyngshede: "Quick patch for making the signup and login process flow better, until all features are in place." [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 (owner: 10Slyngshede) [11:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T335845)', diff saved to https://phabricator.wikimedia.org/P48314 and previous config saved to /var/cache/conftool/dbconfig/20230517-111350-ladsgroup.json [11:14:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:18] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:15:40] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:17] (03PS1) 10Slyngshede: wikimedia: Account linking text. [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 [11:20:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2033', diff saved to https://phabricator.wikimedia.org/P48315 and previous config saved to /var/cache/conftool/dbconfig/20230517-112024-ladsgroup.json [11:22:01] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2034', diff saved to https://phabricator.wikimedia.org/P48316 and previous config saved to /var/cache/conftool/dbconfig/20230517-112251-ladsgroup.json [11:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026', diff saved to https://phabricator.wikimedia.org/P48317 and previous config saved to /var/cache/conftool/dbconfig/20230517-112526-ladsgroup.json [11:26:30] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:28:32] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:28:47] (03PS1) 10Slyngshede: C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920675 [11:28:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P48318 and previous config saved to /var/cache/conftool/dbconfig/20230517-112856-ladsgroup.json [11:29:25] (03CR) 10Muehlenhoff: "wmf-auto-restart is unrelated to debdeploy, where is this causing an issue? Happy to have a closer look." [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [11:29:38] (03CR) 10Slyngshede: [C: 03+2] C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920675 (owner: 10Slyngshede) [11:31:31] (03PS1) 10Slyngshede: C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920676 [11:31:55] (03CR) 10CI reject: [V: 04-1] C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920676 (owner: 10Slyngshede) [11:32:05] (03PS5) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [11:32:13] (03CR) 10CI reject: [V: 04-1] mediawiki-page-content-change-enrichment: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [11:33:15] (03Abandoned) 10Slyngshede: C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920676 (owner: 10Slyngshede) [11:33:20] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:34:34] (03PS2) 10Slyngshede: C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920675 [11:34:58] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. It's not a big problem to not have it on L2 devices but the consistency across the estate is a win good stuff." [homer/public] - 10https://gerrit.wikimedia.org/r/920311 (https://phabricator.wikimedia.org/T320244) (owner: 10Ayounsi) [11:35:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2033 (T335845)', diff saved to https://phabricator.wikimedia.org/P48319 and previous config saved to /var/cache/conftool/dbconfig/20230517-113531-ladsgroup.json [11:35:35] (03PS5) 10Cathal Mooney: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [11:36:10] (03CR) 10Slyngshede: "Not sure if you want to review this, it's pretty minor, but goes with https://gerrit.wikimedia.org/r/c/operations/software/bitu/+/920670" [puppet] - 10https://gerrit.wikimedia.org/r/920675 (owner: 10Slyngshede) [11:37:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 (owner: 10Slyngshede) [11:37:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2034', diff saved to https://phabricator.wikimedia.org/P48320 and previous config saved to /var/cache/conftool/dbconfig/20230517-113757-ladsgroup.json [11:38:23] (03CR) 10Slyngshede: [C: 03+2] Login: Allow landing page to be configured. [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 (owner: 10Slyngshede) [11:38:31] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Login: Allow landing page to be configured. [software/bitu] - 10https://gerrit.wikimedia.org/r/920670 (owner: 10Slyngshede) [11:38:41] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10hnowlan) [11:38:51] !log Update MinT to 2023-05-17-052844-production: Set CT2_USE_EXPERIMENTAL_PACKED_GEMM for better performance [11:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/920675 (owner: 10Slyngshede) [11:39:21] (03CR) 10Slyngshede: [C: 03+2] C:IDM Change landing page to be LDAP properties page. [puppet] - 10https://gerrit.wikimedia.org/r/920675 (owner: 10Slyngshede) [11:39:25] (03CR) 10Ayounsi: [C: 03+1] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [11:39:37] (03CR) 10Slyngshede: [C: 03+2] Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [11:39:51] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) [11:40:01] (03PS6) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) [11:40:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026', diff saved to https://phabricator.wikimedia.org/P48321 and previous config saved to /var/cache/conftool/dbconfig/20230517-114032-ladsgroup.json [11:40:43] (03CR) 10Klausman: [C: 03+1] changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:40:52] (03PS3) 10Arturo Borrero Gonzalez: cloud: wmf-auto-restart: exclude NFS filesystems [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) [11:40:54] (03PS3) 10Arturo Borrero Gonzalez: profile::auto_restarts: allow the systemd timer to not be installed [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) [11:40:56] (03PS2) 10Arturo Borrero Gonzalez: profile::auto_restarts::service: add some spec tests [puppet] - 10https://gerrit.wikimedia.org/r/920654 (https://phabricator.wikimedia.org/T336845) [11:41:45] (03CR) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [11:42:02] (03PS7) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) [11:43:20] (03CR) 10Nikerabbit: [C: 04-1] Enable the new Special:Contribute page entry point for desktop on selected wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [11:44:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P48322 and previous config saved to /var/cache/conftool/dbconfig/20230517-114402-ladsgroup.json [11:44:17] (03CR) 10CI reject: [V: 04-1] profile::auto_restarts::service: add some spec tests [puppet] - 10https://gerrit.wikimedia.org/r/920654 (https://phabricator.wikimedia.org/T336845) (owner: 10Arturo Borrero Gonzalez) [11:46:04] (03CR) 10Ayounsi: Automate and update DHCP relay configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:52:56] (03CR) 10Btullis: [V: 03+1] ceph: Add puppet management of OSDs on new ceph cluster (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [11:53:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2034 (T335845)', diff saved to https://phabricator.wikimedia.org/P48323 and previous config saved to /var/cache/conftool/dbconfig/20230517-115303-ladsgroup.json [11:55:04] (03CR) 10Muehlenhoff: wikimedia: Account linking text. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 (owner: 10Slyngshede) [11:55:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026 (T335845)', diff saved to https://phabricator.wikimedia.org/P48324 and previous config saved to /var/cache/conftool/dbconfig/20230517-115538-ladsgroup.json [11:55:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [11:56:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [11:56:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1033 (T335845)', diff saved to https://phabricator.wikimedia.org/P48325 and previous config saved to /var/cache/conftool/dbconfig/20230517-115612-ladsgroup.json [11:58:19] 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10SCherukuwada) I've responded on the ticket with the volunteer. I'll handle it once they get the NDA and C-level approval out... [11:59:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T335845)', diff saved to https://phabricator.wikimedia.org/P48326 and previous config saved to /var/cache/conftool/dbconfig/20230517-115908-ladsgroup.json [11:59:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance [11:59:24] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:28] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance [11:59:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1034 (T335845)', diff saved to https://phabricator.wikimedia.org/P48327 and previous config saved to /var/cache/conftool/dbconfig/20230517-115943-ladsgroup.json [12:00:56] (03PS2) 10Slyngshede: wikimedia: Account linking text. [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 [12:01:36] (03CR) 10Slyngshede: wikimedia: Account linking text. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 (owner: 10Slyngshede) [12:02:40] (03PS1) 10Muehlenhoff: Remove LDAP access for stef25 [puppet] - 10https://gerrit.wikimedia.org/r/920681 [12:03:35] (03CR) 10CI reject: [V: 04-1] Remove LDAP access for stef25 [puppet] - 10https://gerrit.wikimedia.org/r/920681 (owner: 10Muehlenhoff) [12:03:43] (03CR) 10Muehlenhoff: "One final typo, then we're good to go." [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 (owner: 10Slyngshede) [12:04:04] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:16] (03PS3) 10Slyngshede: wikimedia: Account linking text. [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 [12:04:32] (03PS2) 10Muehlenhoff: Remove LDAP access for stef25 [puppet] - 10https://gerrit.wikimedia.org/r/920681 [12:05:04] (03CR) 10Slyngshede: wikimedia: Account linking text. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 (owner: 10Slyngshede) [12:05:32] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:50] (03CR) 10Slyngshede: [C: 03+2] wikimedia: Account linking text. [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 (owner: 10Slyngshede) [12:05:59] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] wikimedia: Account linking text. [software/bitu] - 10https://gerrit.wikimedia.org/r/920674 (owner: 10Slyngshede) [12:06:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for stef25 [puppet] - 10https://gerrit.wikimedia.org/r/920681 (owner: 10Muehlenhoff) [12:06:53] !log Merging CR822439 and beginning bulk puppetdb -> netbox import to update host interfaces [12:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:09] (03CR) 10Cathal Mooney: [C: 03+2] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [12:07:44] (03Merged) 10jenkins-bot: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [12:10:00] (03PS2) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) [12:10:24] (03CR) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [12:11:56] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [12:12:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [12:12:28] (03CR) 10Filippo Giunchedi: "This is essentially a no-op (other than restarting cadvisor)" [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [12:13:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1033 (T335845)', diff saved to https://phabricator.wikimedia.org/P48328 and previous config saved to /var/cache/conftool/dbconfig/20230517-121306-ladsgroup.json [12:14:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034 (T335845)', diff saved to https://phabricator.wikimedia.org/P48329 and previous config saved to /var/cache/conftool/dbconfig/20230517-121434-ladsgroup.json [12:15:54] (03CR) 10David Caro: ceph: Add puppet management of OSDs on new ceph cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [12:26:27] (03CR) 10JMeybohm: [C: 03+2] envoyproxy: Add python 3.11 to tox [puppet] - 10https://gerrit.wikimedia.org/r/916499 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:26:36] (03CR) 10Nikerabbit: [C: 03+1] Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [12:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1033', diff saved to https://phabricator.wikimedia.org/P48330 and previous config saved to /var/cache/conftool/dbconfig/20230517-122812-ladsgroup.json [12:29:04] (03CR) 10Ottomata: [C: 03+1] Revert "Add flink-app default log config and use it in page_content_change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920576 (owner: 10Gmodena) [12:29:26] (03PS6) 10Cathal Mooney: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [12:29:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034', diff saved to https://phabricator.wikimedia.org/P48331 and previous config saved to /var/cache/conftool/dbconfig/20230517-122940-ladsgroup.json [12:30:02] (03PS7) 10Cathal Mooney: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [12:31:47] (03CR) 10Cathal Mooney: Automate and update DHCP relay configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [12:34:07] (03CR) 10Daniel Kinzler: [C: 03+2] "preparing backport deployment" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [12:34:17] (03CR) 10Daniel Kinzler: [C: 03+2] "preparing backport deployment" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920231 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [12:34:25] (03CR) 10Arturo Borrero Gonzalez: profile::auto_restarts: allow the systemd timer to not be installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [12:39:49] (03CR) 10Ottomata: page_content_change - Consume from mediawiki.page_change.v1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [12:42:03] (03PS1) 10KartikMistry: MinT: Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 [12:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1033', diff saved to https://phabricator.wikimedia.org/P48332 and previous config saved to /var/cache/conftool/dbconfig/20230517-124318-ladsgroup.json [12:43:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:44:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034', diff saved to https://phabricator.wikimedia.org/P48333 and previous config saved to /var/cache/conftool/dbconfig/20230517-124446-ladsgroup.json [12:45:33] (03CR) 10Ottomata: "Ben, I'm adding you because you have been setting up a new airflow instance for product-analytics, and IIRC this was an annoying manual st" [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [12:46:02] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:49] (03CR) 10Gmodena: [C: 03+1] page_content_change - Consume from mediawiki.page_change.v1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [12:49:10] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1003.eqiad.wmnet [12:50:16] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:51:12] (03PS2) 10Arturo Borrero Gonzalez: wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address [dns] - 10https://gerrit.wikimedia.org/r/919363 (https://phabricator.wikimedia.org/T307357) [12:52:35] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records following puppetdb bulk import - cmooney@cumin1001" [12:52:45] (03PS5) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) [12:53:38] (03Merged) 10jenkins-bot: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [12:53:47] (03Merged) 10jenkins-bot: Use MultiHttpClient instead of VirtualRESTService. [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920231 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [12:54:02] PROBLEM - MariaDB Replica Lag: m2 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records following puppetdb bulk import - cmooney@cumin1001" [12:54:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:54:10] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 912.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:35] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) [12:54:45] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [12:55:05] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) 05Open→03Resolved thanks! [12:55:26] PROBLEM - MariaDB Replica Lag: m2 on db2133 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 989.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:56:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1003.eqiad.wmnet [12:56:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1004.eqiad.wmnet [12:56:18] (03CR) 10Ottomata: page_content_change - Consume from mediawiki.page_change.v1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [12:56:58] RECOVERY - MariaDB Replica Lag: m2 on db2133 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:58:01] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8847884, @aborrero wrote: > > * hook the cloudservices boxes to the cloud-private... [12:58:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1033 (T335845)', diff saved to https://phabricator.wikimedia.org/P48334 and previous config saved to /var/cache/conftool/dbconfig/20230517-125824-ladsgroup.json [12:58:25] (03PS8) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [12:58:30] (03PS1) 10Btullis: Update the version of airflow on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/920689 (https://phabricator.wikimedia.org/T336286) [12:58:40] RECOVERY - MariaDB Replica Lag: m2 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:58:48] (03CR) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable cloud-private subnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034 (T335845)', diff saved to https://phabricator.wikimedia.org/P48335 and previous config saved to /var/cache/conftool/dbconfig/20230517-125952-ladsgroup.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1300). [13:00:04] duesen, guerganaWMDE, koi, kart_, hauskater, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:15] wow, a lot of patches! [13:00:16] o/ [13:00:18] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:00:18] wow that's a lot [13:00:20] im here [13:00:26] * kart_ is here [13:00:26] I'm only available for the first ~45 minutes, ftr. [13:00:28] o/ [13:00:29] surely we're not going to get through all of those [13:00:32] There was 5 when I scheduled mine, and now there's 7 [13:00:37] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on A:aqs-canary [13:00:45] hi [13:01:08] i was the last to add items, it's okay if we can't do those [13:01:33] duesen: ping [13:01:39] taavi: o/ [13:01:41] seems like you already +2'd your backports and they've been merged [13:01:46] one of mine was deleted, i had to readd it [13:01:57] so please finish deploying those [13:02:07] taavi: yes. I'd like to do the deployment myself, give me a minute. [13:02:14] sure, please ping me when done [13:02:25] guerganaWMDE: Looks I did that by mistake :/ [13:02:29] mine is closing a wiki, just dblists/* [13:02:37] no prob, i noticed on time :) [13:03:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1004.eqiad.wmnet [13:03:04] taavi: ok, will do. a bit of context: the patches have not been merged into master yet. I want to see that they work first. Testing on the debug host is critical in this case. [13:03:44] (03PS1) 10Ayounsi: netbox-next: enable poweroutlet validator [puppet] - 10https://gerrit.wikimedia.org/r/920690 (https://phabricator.wikimedia.org/T310590) [13:03:46] (03PS1) 10Ayounsi: Netbox prod: add poweroutlet validator [puppet] - 10https://gerrit.wikimedia.org/r/920691 (https://phabricator.wikimedia.org/T310590) [13:06:22] !log daniel@deploy1002 Started scap: Backport for [[gerrit:920230|Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (T335347)]], [[gerrit:920231|Use MultiHttpClient instead of VirtualRESTService. (T335347)]] [13:06:26] T335347: Introduce a new private method `getMultiHttpClient()` to replace `getServiceClient()` - https://phabricator.wikimedia.org/T335347 [13:06:38] (03PS2) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [13:06:59] (03CR) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [13:07:38] (03CR) 10Ayounsi: [C: 03+2] netbox-next: enable poweroutlet validator [puppet] - 10https://gerrit.wikimedia.org/r/920690 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:07:56] !log daniel@deploy1002 daniel: Backport for [[gerrit:920230|Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (T335347)]], [[gerrit:920231|Use MultiHttpClient instead of VirtualRESTService. (T335347)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:08:47] (03PS2) 10Ayounsi: Netbox prod: add poweroutlet validator [puppet] - 10https://gerrit.wikimedia.org/r/920691 (https://phabricator.wikimedia.org/T310590) [13:08:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on A:aqs-canary [13:09:10] (03PS3) 10Ayounsi: Netbox prod: add poweroutlet validator [puppet] - 10https://gerrit.wikimedia.org/r/920691 (https://phabricator.wikimedia.org/T310590) [13:09:12] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [13:09:55] (03CR) 10Cathal Mooney: "LGTM, although in general any IPs assigned like this should be at least created in Netbox and set to 'reserved' before adding elsewhere. " [dns] - 10https://gerrit.wikimedia.org/r/919363 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [13:10:01] (03CR) 10Cathal Mooney: [C: 03+1] wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address [dns] - 10https://gerrit.wikimedia.org/r/919363 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [13:11:21] I'll move my patch to tomorrow. It seems it won't get space in this backport window. [13:12:10] (03PS3) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [13:12:28] taavi: i just verified on the debug hosts, looking good. [13:13:38] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10JMeybohm) Done in deployment-charts and puppet [13:13:44] guerganaWMDE: your patches are up next - is it ok to do both at the same time? [13:13:53] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10JMeybohm) 05In progress→03Resolved [13:13:53] they are chained [13:14:22] yes, i guess we can do them at the same time [13:14:27] great [13:15:04] (03CR) 10Herron: [C: 03+1] cadvisor: add explicity metrics enable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:15:29] kart_: sorry about that, looking at it now prod and repo were already in sync, just through an indirect path (I cherry-picked the patch, staged on mwdebug to test, deployed it, then +2'ed it). All that was missing is a git pull, but the pull would not actually bring in any new changes, it would merely fast forward to the identical commit. [13:16:12] kart_: may I ask which command you used where it showed this as a problem? [13:16:15] taavi: just lemme know which server to test with [13:16:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet [13:17:08] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on P{aqs1011*} and A:aqs [13:17:16] (03CR) 10Herron: [C: 03+2] "thx for the review" [puppet] - 10https://gerrit.wikimedia.org/r/920368 (owner: 10Herron) [13:17:20] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:26] (03PS2) 10Herron: logrotate: update description in override [puppet] - 10https://gerrit.wikimedia.org/r/920368 [13:17:28] (03CR) 10Klausman: [C: 03+2] helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [13:17:54] (03PS3) 10Elukey: Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) [13:17:56] (03PS3) 10Elukey: Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) [13:18:14] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:920230|Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (T335347)]], [[gerrit:920231|Use MultiHttpClient instead of VirtualRESTService. (T335347)]] (duration: 11m 52s) [13:18:18] T335347: Introduce a new private method `getMultiHttpClient()` to replace `getServiceClient()` - https://phabricator.wikimedia.org/T335347 [13:19:00] taavi: ok, deployment is complete. I'll poke at it on the live site a bit. Let's hope I didn't break anything :) [13:19:11] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:19:12] thanks! continuing with the rest of the window [13:19:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920296 (https://phabricator.wikimedia.org/T336760) (owner: 10Guergana Tzatchkova) [13:19:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920306 (https://phabricator.wikimedia.org/T335099) (owner: 10Guergana Tzatchkova) [13:20:06] (03Merged) 10jenkins-bot: Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920296 (https://phabricator.wikimedia.org/T336760) (owner: 10Guergana Tzatchkova) [13:20:09] (03Merged) 10jenkins-bot: Enable wmgWikibaseTmpEnableLabelsInApiSummaries on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920306 (https://phabricator.wikimedia.org/T335099) (owner: 10Guergana Tzatchkova) [13:20:16] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:20:20] taavi do i get to test it first with Mediawiki debug? O: [13:20:30] or are they deployed already? [13:20:38] !log taavi@deploy1002 Started scap: Backport for [[gerrit:920296|Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata (T336760)]], [[gerrit:920306|Enable wmgWikibaseTmpEnableLabelsInApiSummaries on Wikidata (T335099)]] [13:20:40] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8859005, @aborrero wrote: > With the remaining bits being: > * allocate a public IP... [13:20:43] T335099: Wikidata: enable entity labels in parsed edit summaries in API requests - https://phabricator.wikimedia.org/T335099 [13:20:44] T336760: Wikidata: enable fix for `list=wbsubscribers` response format - https://phabricator.wikimedia.org/T336760 [13:20:54] guerganaWMDE: you'll be able to test it, don't worry.. just takes a while to pull it to debug servers [13:21:06] ah ok. i was getting scared [13:21:08] (03CR) 10Elukey: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [13:22:04] !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [13:22:09] !log taavi@deploy1002 gtzatchkova and taavi: Backport for [[gerrit:920296|Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata (T336760)]], [[gerrit:920306|Enable wmgWikibaseTmpEnableLabelsInApiSummaries on Wikidata (T335099)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:22:14] guerganaWMDE: please test [13:22:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:22:18] oki, one sec [13:23:16] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:23:22] (03PS2) 10Majavah: plwiki: Show language selector in main page header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920396 (https://phabricator.wikimedia.org/T336707) (owner: 10Stang) [13:23:44] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:24:10] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:24:10] taavi: both are working fine it seems! thanks! [13:24:20] ok, syncing! [13:24:25] (03CR) 10Majavah: [C: 03+2] plwiki: Show language selector in main page header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920396 (https://phabricator.wikimedia.org/T336707) (owner: 10Stang) [13:24:30] taavi: thank you :) [13:24:35] Krinkle: sorry was afk. It was usual 'scap backport ..' [13:25:08] (03PS1) 10Elukey: services: change lift wing's kafka topic in changeprop's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) [13:25:18] (03Merged) 10jenkins-bot: plwiki: Show language selector in main page header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920396 (https://phabricator.wikimedia.org/T336707) (owner: 10Stang) [13:25:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on P{aqs1011*} and A:aqs [13:25:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet [13:25:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet [13:25:37] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on P{aqs102[0-1]*} and A:aqs [13:26:32] kart_: interesting, I didn't realize we had comments that cared about the distinctin between origin/HEAD and remotes/origin/HEAD. I'll keep that in mind. [13:26:36] commands* [13:26:45] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [13:26:53] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) [13:28:22] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) 05Resolved→03Open @aborrero quick question on how best to set... [13:29:07] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) a:05Papaul→03cmooney [13:29:25] Krinkle: Thanks! [13:29:53] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:920296|Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata (T336760)]], [[gerrit:920306|Enable wmgWikibaseTmpEnableLabelsInApiSummaries on Wikidata (T335099)]] (duration: 09m 15s) [13:29:59] T335099: Wikidata: enable entity labels in parsed edit summaries in API requests - https://phabricator.wikimedia.org/T335099 [13:29:59] T336760: Wikidata: enable fix for `list=wbsubscribers` response format - https://phabricator.wikimedia.org/T336760 [13:30:29] !log taavi@deploy1002 Started scap: Backport for [[gerrit:920396|plwiki: Show language selector in main page header (T336707)]] [13:30:33] T336707: Enable language button in main page header (plwiki) - https://phabricator.wikimedia.org/T336707 [13:32:01] !log taavi@deploy1002 stang and taavi: Backport for [[gerrit:920396|plwiki: Show language selector in main page header (T336707)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:32:15] koi: please test [13:32:31] taavi, tested and LGTM [13:32:41] thanks, syncing [13:33:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet [13:33:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet [13:34:29] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Urgent: disk failed in ms-be1063 - https://phabricator.wikimedia.org/T336778 (10Jclark-ctr) a:03Jclark-ctr [13:34:58] (03PS5) 10Majavah: dblists: Close akwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675) (owner: 10MarcoAurelio) [13:35:28] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336538 (10Jhancock.wm) 05Open→03Resolved [13:36:26] (03CR) 10Majavah: [C: 03+2] dblists: Close akwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675) (owner: 10MarcoAurelio) [13:37:20] (03Merged) 10jenkins-bot: dblists: Close akwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675) (owner: 10MarcoAurelio) [13:37:37] (03CR) 10Btullis: [C: 03+2] Update the version of airflow on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/920689 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [13:38:09] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:920396|plwiki: Show language selector in main page header (T336707)]] (duration: 07m 39s) [13:38:13] T336707: Enable language button in main page header (plwiki) - https://phabricator.wikimedia.org/T336707 [13:38:47] !log taavi@deploy1002 Started scap: Backport for [[gerrit:920244|dblists: Close akwiki (T336675)]] [13:38:52] T336675: Close ak.wikipedia - https://phabricator.wikimedia.org/T336675 [13:39:25] MatmaRex: your backport does not seem to have been merged into master yet [13:40:15] !log taavi@deploy1002 taavi and maurelio: Backport for [[gerrit:920244|dblists: Close akwiki (T336675)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:40:19] hauskater: please test [13:40:29] taavi: checking [13:41:15] taavi: lgtm via ListGroupRights [13:41:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet [13:41:19] thx, syncing [13:41:20] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet [13:42:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on P{aqs102[0-1]*} and A:aqs [13:42:52] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on P{aqs101[2-5]*} and A:aqs [13:42:58] taavi: yeah, i'll poke someone to do it if it really does fix it in production [13:43:14] taavi: or you could merge it if you want [13:43:54] in general the backport window policy is that only patches already merged into master can be backported... [13:45:00] but this seems small enough [13:45:04] (03CR) 10Majavah: [C: 03+2] Define $maintClass in maintenance script for compatibility [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920582 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:46:02] (03CR) 10Eevans: [C: 03+1] swift: remove ms-be204[0-3] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/920645 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon) [13:46:59] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:920244|dblists: Close akwiki (T336675)]] (duration: 08m 11s) [13:47:03] T336675: Close ak.wikipedia - https://phabricator.wikimedia.org/T336675 [13:47:05] (03CR) 10Ottomata: services: change lift wing's kafka topic in changeprop's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey) [13:47:29] (03PS1) 10Herron: mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) [13:48:06] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [13:48:28] taavi: o/ oo lots of patches this window! good luck! let me know when you are finished, I have a config patch to deploy [13:48:50] ottomata: waiting for the CI for the last one! [13:49:07] (03PS2) 10Elukey: services: change lift wing's kafka topic in changeprop's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) [13:49:12] (03CR) 10Elukey: services: change lift wing's kafka topic in changeprop's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey) [13:49:14] (03PS2) 10Ottomata: page_content_change - Consume from mediawiki.page_change.v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) [13:49:30] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Docker [13:49:37] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Urgent: disk failed in ms-be1063 - https://phabricator.wikimedia.org/T336778 (10Jclark-ctr) Confirmed: Service Request 168322271 @MatthewVernon Replaced failed drive with spare drive on site when dell sends replacement will just add to spares [13:49:41] (03PS2) 10Herron: mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) [13:50:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet [13:50:35] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet [13:50:36] (03PS3) 10Herron: mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) [13:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:40] (03Merged) 10jenkins-bot: Define $maintClass in maintenance script for compatibility [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920582 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:52:25] !log taavi@deploy1002 Started scap: Backport for [[gerrit:920582|Define $maintClass in maintenance script for compatibility (T317375)]] [13:52:29] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [13:53:59] !log taavi@deploy1002 matmarex and taavi: Backport for [[gerrit:920582|Define $maintClass in maintenance script for compatibility (T317375)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:54:20] nothing to test, syncing [13:54:30] thanks. i can't test this on mwdebug, it's just a maintenance script [13:54:32] yeah [13:56:59] (03CR) 10Ssingh: [C: 03+1] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [13:57:31] (03CR) 10Elukey: [C: 03+2] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [13:59:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet [13:59:46] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [13:59:49] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:920582|Define $maintClass in maintenance script for compatibility (T317375)]] (duration: 07m 24s) [13:59:54] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [14:01:49] MatmaRex: https://phabricator.wikimedia.org/P48336 [14:02:11] (03PS4) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) [14:02:20] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10Jclark-ctr) a:03Jclark-ctr [14:02:38] taavi: thank you. that looks as expected [14:02:48] ottomata: I'm done [14:03:01] taavi: ty [14:03:13] (03CR) 10Ottomata: [C: 03+2] Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [14:03:47] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: use correct image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/920304 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:04:09] (03Merged) 10jenkins-bot: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [14:04:34] (03Merged) 10jenkins-bot: rdf-streaming-updater: use correct image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/920304 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:04:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Urgent: disk failed in ms-be1063 - https://phabricator.wikimedia.org/T336778 (10Jclark-ctr) @MatthewVernon ms-be1063 is showing the drive in ready state and would need to be added to clear errors [14:05:26] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet [14:07:44] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10Jclark-ctr) @wiki_willy Server is out of warranty i do not have any spare new drives. I do have disk from recently decommissioned servers we could possibly used [14:08:25] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:08:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Urgent: disk failed in ms-be1063 - https://phabricator.wikimedia.org/T336778 (10MatthewVernon) @Jclark-ctr thanks, I've added the new drive and it seems to be working fine. Thanks for the quick fix. [14:09:17] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:09:23] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:10:53] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:11:31] (03CR) 10Ssingh: [C: 03+1] Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [14:12:08] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [14:12:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Urgent: disk failed in ms-be1063 - https://phabricator.wikimedia.org/T336778 (10Jclark-ctr) 05Open→03Resolved [14:12:54] 10SRE-swift-storage: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 (10Jclark-ctr) [14:12:56] 10SRE-swift-storage: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 (10Jclark-ctr) [14:13:45] (03Abandoned) 10Jdrewniak: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920241 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [14:14:04] (03Abandoned) 10Jdrewniak: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920239 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [14:14:39] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: wgEventStreams - Declare mediawiki.page_change.v1 stream - T336817 (duration: 07m 30s) [14:14:44] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [14:15:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on P{aqs101[2-5]*} and A:aqs [14:15:35] (03PS1) 10Muehlenhoff: Retire sre.aqs.roll-restart [cookbooks] - 10https://gerrit.wikimedia.org/r/920704 (https://phabricator.wikimedia.org/T330889) [14:15:50] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on P{aqs101[6-9]*} and A:aqs [14:16:24] (03CR) 10Elukey: [C: 03+2] Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:41] !log run authdns-update for new ml-serve/ores discovery endpoints - T336726 [14:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] T336726: Create k8s ingress config and VIP for ores-legacy - https://phabricator.wikimedia.org/T336726 [14:19:39] MatmaRex: hi, is wikimedia/portals.git still consuming data from Module:Project portal/wikis @ meta ? [14:19:52] or portals/deploy [14:20:28] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jclark-ctr) opened Service request with dell. Confirmed: Service Request 168325397 was successfully submitted. While waiting for response i have p... [14:20:34] MatmaRex: it's not using the Module:Project data anymore, but I haven't deployed it for a couple of weeks (if it looks outdated) [14:20:53] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jclark-ctr) a:03Jclark-ctr [14:20:59] (03PS3) 10Arturo Borrero Gonzalez: wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address [dns] - 10https://gerrit.wikimedia.org/r/919363 (https://phabricator.wikimedia.org/T307357) [14:21:01] MatmaRex: wait you mean www.wikimedia.org? [14:21:05] oh, sorry, I meant jan_drewniak [14:21:38] jan_drewniak: we've just closed a couple of wikis and I wonder if we have to tag them as closed or the bot takes care of that [14:23:14] RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:24:07] hauskater: ok just confirmed that it's using the portals/deploy git repo, not the Module. I can update the repo though. Which projects have closed? [14:25:41] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jclark-ctr) @Jelto Server has booted properly with no errors. Can you put server back in service. To see if error returns? [14:25:49] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) @ssingh is it safe for me to physically remove lvs2008 and the three dns servers from the racks and offline them? [14:26:18] hauskater: if they're individual language wikis then tag them as closed, and the update scripts should update the portals accordingly. [14:26:21] jan_drewniak: but portals/deploy gets data from /portals right? [14:26:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Jhancock.wm) [14:27:00] hauskater: yeah, it's very overly complicated tbh, but the script checks for the 'closed' tag. [14:27:15] !log rolling restart of eventgate-main to pick up new mediawiki.page_change.v1 stream config - T336817 [14:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:20] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [14:27:24] jan_drewniak: yep, I'm trying to find where to add the tag [14:27:32] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [14:27:48] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [14:28:17] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [14:28:39] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [14:29:38] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) >>! In T336587#8859125, @cmooney wrote: > @aborrero quick questi... [14:30:25] plus the docs still mention fetching data from meta, probably needs an update :) [14:30:27] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [14:30:48] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [14:31:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address [dns] - 10https://gerrit.wikimedia.org/r/919363 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:33:34] !log EventBus: produce to mediawiki.page_change.v1 stream - T336817 [14:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:39] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [14:34:24] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336720 (10Jhancock.wm) 05Open→03Resolved same server as before. resolving [14:34:32] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@ad1cc7c]: deploying hotfix for T336800 [14:34:36] T336800: platform_eng Airflow instance Spark jobs failing after Iceberg changes - https://phabricator.wikimedia.org/T336800 [14:34:41] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@ad1cc7c]: deploying hotfix for T336800 (duration: 00m 09s) [14:35:22] (03CR) 10Muehlenhoff: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [14:36:12] (03CR) 10Ottomata: [C: 03+2] page_content_change - Consume from mediawiki.page_change.v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [14:36:25] !log installing jackson-databind security updates [14:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:51] (03Merged) 10jenkins-bot: page_content_change - Consume from mediawiki.page_change.v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [14:38:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker [14:39:26] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) @aborrero in general that makes sense yes. > if we are annoyed b... [14:39:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1027.eqiad.wmnet with reason: Maintenance [14:39:38] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams - EventBus: produce to mediawiki.page_change.v1 stream - T336817 (duration: 06m 20s) [14:39:42] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [14:39:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1027.eqiad.wmnet with reason: Maintenance [14:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1027 (T335845)', diff saved to https://phabricator.wikimedia.org/P48337 and previous config saved to /var/cache/conftool/dbconfig/20230517-143949-ladsgroup.json [14:40:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance [14:40:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance [14:40:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2028 (T335845)', diff saved to https://phabricator.wikimedia.org/P48338 and previous config saved to /var/cache/conftool/dbconfig/20230517-144025-ladsgroup.json [14:40:39] (03PS1) 10Elukey: utils: fix k8s-ingress-ml-serve discovery config [dns] - 10https://gerrit.wikimedia.org/r/920709 [14:41:20] (03CR) 10Ssingh: [C: 03+1] utils: fix k8s-ingress-ml-serve discovery config [dns] - 10https://gerrit.wikimedia.org/r/920709 (owner: 10Elukey) [14:41:27] (03CR) 10CI reject: [V: 04-1] utils: fix k8s-ingress-ml-serve discovery config [dns] - 10https://gerrit.wikimedia.org/r/920709 (owner: 10Elukey) [14:41:29] ha [14:42:32] (03PS2) 10Elukey: utils: fix k8s-ingress-ml-serve discovery config [dns] - 10https://gerrit.wikimedia.org/r/920709 [14:43:27] (03CR) 10CI reject: [V: 04-1] utils: fix k8s-ingress-ml-serve discovery config [dns] - 10https://gerrit.wikimedia.org/r/920709 (owner: 10Elukey) [14:44:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T335845)', diff saved to https://phabricator.wikimedia.org/P48339 and previous config saved to /var/cache/conftool/dbconfig/20230517-144425-ladsgroup.json [14:44:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027 (T335845)', diff saved to https://phabricator.wikimedia.org/P48340 and previous config saved to /var/cache/conftool/dbconfig/20230517-144446-ladsgroup.json [14:45:31] (03PS1) 10Elukey: Revert "Add ores-legacy.discovery.wment configuration" [dns] - 10https://gerrit.wikimedia.org/r/920726 [14:46:33] (03PS2) 10Elukey: Revert "Add ores-legacy.discovery.wment configuration" [dns] - 10https://gerrit.wikimedia.org/r/920726 [14:48:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on P{aqs101[6-9]*} and A:aqs [14:48:32] (03CR) 10Elukey: [C: 03+2] Revert "Add ores-legacy.discovery.wment configuration" [dns] - 10https://gerrit.wikimedia.org/r/920726 (owner: 10Elukey) [14:48:34] (03CR) 10Ssingh: [C: 03+1] Revert "Add ores-legacy.discovery.wment configuration" [dns] - 10https://gerrit.wikimedia.org/r/920726 (owner: 10Elukey) [14:48:57] (03PS1) 10Elukey: Revert "Add discovery configuration for k8s-ingress-ml-serve" [dns] - 10https://gerrit.wikimedia.org/r/920727 [14:49:03] (03PS2) 10Elukey: Revert "Add discovery configuration for k8s-ingress-ml-serve" [dns] - 10https://gerrit.wikimedia.org/r/920727 [14:49:40] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [14:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:50:05] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) >>! In T335777#8859472, @Jhancock.wm wrote: > @ssingh is it safe for me to physically remove lvs2008 and the three dns servers from the racks and offline t... [14:50:09] (03CR) 10Ssingh: [C: 03+1] Revert "Add discovery configuration for k8s-ingress-ml-serve" [dns] - 10https://gerrit.wikimedia.org/r/920727 (owner: 10Elukey) [14:50:15] (03CR) 10Elukey: [C: 03+2] Revert "Add discovery configuration for k8s-ingress-ml-serve" [dns] - 10https://gerrit.wikimedia.org/r/920727 (owner: 10Elukey) [14:50:46] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1115.eqiad.wmnet - https://phabricator.wikimedia.org/T336253 (10Jclark-ctr) [14:51:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1115.eqiad.wmnet - https://phabricator.wikimedia.org/T336253 (10Jclark-ctr) 05Open→03Resolved [14:51:18] arturo: o/ ok to authdns-update your change? [14:51:21] 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) Thanks a lot! sounds good:) [14:51:35] elukey: yes, sorry [14:51:52] elukey: I thought I had done it already [14:52:22] arturo: nono I broke authdns update (and now I am fixing it) so it wouldn't have worked :D [14:52:39] ok :-P [14:52:47] done :) [14:53:44] (03PS1) 10Hnowlan: rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) [14:54:55] (03PS1) 10Elukey: Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920712 (https://phabricator.wikimedia.org/T336726) [14:55:45] (03CR) 10CI reject: [V: 04-1] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920712 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [14:56:37] (03CR) 10Ottomata: [C: 03+1] "mediawiki.page_change.v1 is ready to go with real data. GO FOR IT!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey) [14:57:36] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10wiki_willy) Thanks @Jclark-ctr. Feel free to pull the drives from a server that's already been decommissoned. >>! In T336826#8859354, @Jclark-ctr wrote: > @wiki_willy Server is out of warr... [14:59:32] (03PS2) 10Elukey: Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920712 (https://phabricator.wikimedia.org/T336726) [14:59:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P48341 and previous config saved to /var/cache/conftool/dbconfig/20230517-145932-ladsgroup.json [14:59:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027', diff saved to https://phabricator.wikimedia.org/P48342 and previous config saved to /var/cache/conftool/dbconfig/20230517-145952-ladsgroup.json [15:01:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [15:01:45] (03CR) 10Ssingh: [C: 03+1] "Let's do it 😊" [dns] - 10https://gerrit.wikimedia.org/r/920712 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [15:02:16] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 (10Jclark-ctr) [15:02:19] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 (10Jclark-ctr) 05Open→03Resolved [15:03:01] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Jclark-ctr) [15:03:07] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Jclark-ctr) 05Open→03Resolved [15:03:52] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10Jclark-ctr) [15:03:59] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10Jclark-ctr) 05Open→03Resolved [15:06:16] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:13] (03CR) 10Btullis: [C: 03+1] Retire sre.aqs.roll-restart [cookbooks] - 10https://gerrit.wikimedia.org/r/920704 (https://phabricator.wikimedia.org/T330889) (owner: 10Muehlenhoff) [15:07:17] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:07:28] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:07:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [15:07:48] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:04] (03PS5) 10ArielGlenn: Modify runtime of html dumps rsync to secondary host [puppet] - 10https://gerrit.wikimedia.org/r/919859 (https://phabricator.wikimedia.org/T335761) (owner: 10Hokwelum) [15:09:50] (03CR) 10ArielGlenn: [C: 03+2] Modify runtime of html dumps rsync to secondary host [puppet] - 10https://gerrit.wikimedia.org/r/919859 (https://phabricator.wikimedia.org/T335761) (owner: 10Hokwelum) [15:12:56] (03CR) 10Klausman: [C: 03+1] services: change lift wing's kafka topic in changeprop's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey) [15:14:21] (03CR) 10Mvolz: "If I'm reading this correctly, this means we don't need https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/907481 ? I was kind o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [15:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P48343 and previous config saved to /var/cache/conftool/dbconfig/20230517-151438-ladsgroup.json [15:14:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [15:14:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027', diff saved to https://phabricator.wikimedia.org/P48344 and previous config saved to /var/cache/conftool/dbconfig/20230517-151458-ladsgroup.json [15:16:40] (03CR) 10Filippo Giunchedi: [C: 03+1] mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [15:18:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [15:19:11] (03CR) 10MVernon: [C: 03+2] swift: remove ms-be204[0-3] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/920645 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon) [15:21:15] (03CR) 10Andrea Denisse: [C: 03+1] mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [15:24:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:25:27] (03PS2) 10Hnowlan: rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) [15:25:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2002.wikimedia.org [15:25:40] (03CR) 10Hnowlan: rest-gateway: add citoid support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [15:26:32] (03CR) 10Hnowlan: [C: 03+1] services: change lift wing's kafka topic in changeprop's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey) [15:27:14] (03CR) 10Mvolz: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [15:29:18] RECOVERY - ircecho bot process on irc2002 is OK: PROCS OK: 1 process with command name python3, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [15:29:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2002.wikimedia.org [15:29:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T335845)', diff saved to https://phabricator.wikimedia.org/P48345 and previous config saved to /var/cache/conftool/dbconfig/20230517-152945-ladsgroup.json [15:29:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: Maintenance [15:30:02] (03CR) 10Elukey: [C: 03+2] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920712 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [15:30:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: Maintenance [15:30:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027 (T335845)', diff saved to https://phabricator.wikimedia.org/P48346 and previous config saved to /var/cache/conftool/dbconfig/20230517-153004-ladsgroup.json [15:30:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2032 (T335845)', diff saved to https://phabricator.wikimedia.org/P48347 and previous config saved to /var/cache/conftool/dbconfig/20230517-153010-ladsgroup.json [15:30:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance [15:30:36] (03CR) 10Kimberly Sarabia: Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [15:30:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance [15:30:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1032 (T335845)', diff saved to https://phabricator.wikimedia.org/P48348 and previous config saved to /var/cache/conftool/dbconfig/20230517-153042-ladsgroup.json [15:30:51] (03PS4) 10Kimberly Sarabia: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) [15:32:36] (03PS1) 10Elukey: Revert "Revert "Add ores-legacy.discovery.wment configuration"" [dns] - 10https://gerrit.wikimedia.org/r/920729 [15:32:42] (03PS2) 10Elukey: Revert "Revert "Add ores-legacy.discovery.wment configuration"" [dns] - 10https://gerrit.wikimedia.org/r/920729 [15:33:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10darthmon_wmde) hi @KFrancis @Dzahn , my sincere apologies that I was MiA last week and did not reply to your request. Thanks @Dzahn for cr... [15:33:55] (03CR) 10Elukey: [C: 03+2] Revert "Revert "Add ores-legacy.discovery.wment configuration"" [dns] - 10https://gerrit.wikimedia.org/r/920729 (owner: 10Elukey) [15:34:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2032 (T335845)', diff saved to https://phabricator.wikimedia.org/P48349 and previous config saved to /var/cache/conftool/dbconfig/20230517-153410-ladsgroup.json [15:35:28] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) It's all good, thanks for the heads-up!:) sounds good [15:37:35] (03CR) 10Stevemunene: Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [15:38:09] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [15:38:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:39:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1032 (T335845)', diff saved to https://phabricator.wikimedia.org/P48350 and previous config saved to /var/cache/conftool/dbconfig/20230517-153925-ladsgroup.json [15:43:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:44:29] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920722 (https://phabricator.wikimedia.org/T308134) [15:44:31] (03CR) 10Muehlenhoff: sre.ganeti.makevm call reimage after VM creation (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [15:46:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:48:01] (03PS2) 10BCornwall: pybal: Switch esams LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/917399 (https://phabricator.wikimedia.org/T263797) [15:48:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:48:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10hnowlan) 05Open→03Resolved a:03hnowlan [15:48:58] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10hnowlan) [15:49:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2032', diff saved to https://phabricator.wikimedia.org/P48351 and previous config saved to /var/cache/conftool/dbconfig/20230517-154916-ladsgroup.json [15:49:25] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) [15:49:47] (03CR) 10Ssingh: [C: 03+1] pybal: Switch esams LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/917399 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [15:50:22] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Jdforrester-WMF) [15:50:26] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41213/console" [puppet] - 10https://gerrit.wikimedia.org/r/917399 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [15:50:46] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:52:05] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:52:14] !log Rolling out maglev LVS scheduler in esams - T263797 [15:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:18] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [15:53:02] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10hnowlan) [15:54:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1032', diff saved to https://phabricator.wikimedia.org/P48352 and previous config saved to /var/cache/conftool/dbconfig/20230517-155431-ladsgroup.json [15:54:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:51] (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Switch esams LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/917399 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [15:56:19] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - bump image version to 1.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920723 [15:56:24] (03CR) 10Jaime Nuche: "PCC results: https://puppet-compiler.wmflabs.org/output/920669/41214/" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [15:56:31] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:57:25] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mediawiki-page-content-change-enrichment - bump image version to 1.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920723 (owner: 10Ottomata) [15:57:41] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:59:10] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:00:01] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:00:10] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:01:38] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:01:50] ^ expected, Pybal upgrades in esams [16:02:04] PROBLEM - pybal on lvs3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:02:07] (03PS2) 10Hokwelum: Add dump user subdirectories to support testing of new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915423 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:02:42] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:03:18] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [16:03:32] (03PS9) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [16:03:36] (03PS1) 10Jelto: miscweb: add annualreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/920724 (https://phabricator.wikimedia.org/T300171) [16:04:04] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [16:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2032', diff saved to https://phabricator.wikimedia.org/P48353 and previous config saved to /var/cache/conftool/dbconfig/20230517-160423-ladsgroup.json [16:05:44] (03CR) 10Dzahn: [C: 03+1] miscweb: add annualreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/920724 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [16:05:46] (03CR) 10Hnowlan: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [16:08:03] (03CR) 10Jelto: [C: 03+2] miscweb: add annualreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/920724 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [16:08:28] (03PS1) 10AikoChou: ml-services: change isvc name to revertrisk-language-agnostic [deployment-charts] - 10https://gerrit.wikimedia.org/r/920725 (https://phabricator.wikimedia.org/T332998) [16:09:09] (03Merged) 10jenkins-bot: miscweb: add annualreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/920724 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [16:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1032', diff saved to https://phabricator.wikimedia.org/P48354 and previous config saved to /var/cache/conftool/dbconfig/20230517-160937-ladsgroup.json [16:13:38] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:14:33] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:17:55] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:18:23] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:18:32] (03PS2) 10ArielGlenn: add nfs tester to dumps worker (snapshot) testbed role [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) [16:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2032 (T335845)', diff saved to https://phabricator.wikimedia.org/P48355 and previous config saved to /var/cache/conftool/dbconfig/20230517-161929-ladsgroup.json [16:20:38] (03CR) 10Ayounsi: [C: 03+2] interface validator: workaround bug with count_ipaddresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/920656 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:21:18] (03Merged) 10jenkins-bot: interface validator: workaround bug with count_ipaddresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/920656 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:21:37] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [16:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1032 (T335845)', diff saved to https://phabricator.wikimedia.org/P48356 and previous config saved to /var/cache/conftool/dbconfig/20230517-162444-ladsgroup.json [16:24:48] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:25:16] RECOVERY - pybal on lvs3006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:25:28] RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [16:25:54] RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:26:23] Reminder that pybal errors are normal during this rollout [16:28:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [16:37:51] (03CR) 10Dzahn: [C: 03+2] gerrit: remove gerrit1001 from .ssh/config [puppet] - 10https://gerrit.wikimedia.org/r/919403 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [16:38:26] (03CR) 10David Caro: [C: 03+2] toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918544 (owner: 10David Caro) [16:38:50] (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh [puppet] - 10https://gerrit.wikimedia.org/r/919402 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:39:31] dcaro: you can type "multiple" or I can [16:39:40] but one of has to.. it cant separate them in this case [16:40:09] sometimes it can [16:40:33] mutante: done :), sorry for the delay, irccloud UI did not let me see the channel for some reason (shift + r worked though) [16:41:30] dcaro: all good:) thanks [16:41:45] going ahead with puppet runs for mine [16:42:18] (03PS2) 10Dzahn: gerrit: remove gerrit1001 from .ssh/config [puppet] - 10https://gerrit.wikimedia.org/r/919403 (https://phabricator.wikimedia.org/T336427) [16:42:40] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:44:56] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:18] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:45:50] PROBLEM - pybal on lvs3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:47:44] (03PS4) 10Dzahn: gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) [16:49:00] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:50:28] (03PS5) 10Dzahn: gerrit: remove gerrit1001 from acme_chief, ssh known_hosts and firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) [16:51:04] (03CR) 10Dzahn: [C: 03+2] gerrit: remove gerrit1001 from acme_chief, ssh known_hosts and firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [16:51:39] (03CR) 10Dzahn: [C: 03+2] "this makes it even safer, now gerrit1001 cant talk to the other hosts on network level - ssh is blocked" [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [16:56:08] (03CR) 10Dzahn: [C: 03+1] "we said today in meeting that gerrit service could already be removed and it's stopped permanently.. so it's very very unlikely we have to" [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [16:58:04] (03CR) 10Dzahn: [C: 03+1] "you can look up this IP in DNS and get "gerrit-old", fwiw." [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [16:58:50] !log Running `foreachwiki extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --video --mime=video/mpeg --missing --error --stalled --throttle` on mwmaint1002 for T244570 [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:54] (03PS1) 10Dzahn: site: remove gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/920749 [16:58:54] T244570: Fix missing thumbnail for MPEG (.mpg) videos - https://phabricator.wikimedia.org/T244570 [16:58:56] Ha. [16:59:51] (03PS2) 10Dzahn: site: remove gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/920749 (https://phabricator.wikimedia.org/T336427) [17:00:01] (03CR) 10Dzahn: [V: 04-1 C: 04-1] site: remove gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/920749 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1700) [17:00:55] (03PS1) 10David Caro: toolforge: fix typo in config key [puppet] - 10https://gerrit.wikimedia.org/r/920750 [17:00:55] mutante: I am sure you remember but we should also remove it from homer:definitions/static.net. happy to take care of that + rollout if desired [17:01:04] (03CR) 10David Caro: [C: 03+2] toolforge: fix typo in config key [puppet] - 10https://gerrit.wikimedia.org/r/920750 (owner: 10David Caro) [17:01:24] (03PS2) 10David Caro: toolforge: fix typo in config key [puppet] - 10https://gerrit.wikimedia.org/r/920750 [17:02:26] (03CR) 10David Caro: [C: 03+2] toolforge: fix typo in config key [puppet] - 10https://gerrit.wikimedia.org/r/920750 (owner: 10David Caro) [17:03:32] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:03:54] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:04:26] RECOVERY - pybal on lvs3005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:05:38] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:06:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:06] (03CR) 10Hokwelum: [C: 03+1] "checks out :-)" [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:07:33] (03CR) 10Hokwelum: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/915423 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:07:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:40] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q4), 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10lmata) [17:12:11] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q4), 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10lmata) p:05Triage→03Medium [17:16:31] (03PS1) 10Andrew Bogott: envscripts: include OS_CLOUD in environment. [puppet] - 10https://gerrit.wikimedia.org/r/920753 [17:17:07] (03PS1) 10Cathal Mooney: Don't set routed sub-int parents as L2 trunks on switches [homer/public] - 10https://gerrit.wikimedia.org/r/920754 (https://phabricator.wikimedia.org/T296832) [17:19:00] !log Maglev LVS scheduler rollout finished in esams - T263797 [17:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:05] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [17:19:47] (03CR) 10Cathal Mooney: "Self-merging so homer and switch port cookbook runs cleanly for now." [homer/public] - 10https://gerrit.wikimedia.org/r/920754 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [17:19:51] (03CR) 10Cathal Mooney: [C: 03+2] Don't set routed sub-int parents as L2 trunks on switches [homer/public] - 10https://gerrit.wikimedia.org/r/920754 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [17:20:31] (03Merged) 10jenkins-bot: Don't set routed sub-int parents as L2 trunks on switches [homer/public] - 10https://gerrit.wikimedia.org/r/920754 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [17:25:08] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) @aborrero set up now, let me know if this looks correct: [clouds... [17:35:08] jan_drewniak: I give up :-) npm / gulp refuses to work for me re portals :) [17:35:30] and gulp's broken I think [17:41:54] 10SRE, 10ops-eqsin, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus5001 - https://phabricator.wikimedia.org/T335587 (10RobH) 05Open→03Declined So it appears this was created as a #decommission task for a ganeti vm, which isn't needed, but I can underst... [17:42:16] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Bring stat1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/919826 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [17:42:38] (03Abandoned) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/914343 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [17:43:12] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: sync [17:43:15] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: sync [17:44:26] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:44:27] (03PS1) 10Bartosz Dziewoński: Define $maintClass in maintenance script for compatibility [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920731 (https://phabricator.wikimedia.org/T317375) [17:44:28] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:44:38] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:44:43] (03PS1) 10Bartosz Dziewoński: NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920732 (https://phabricator.wikimedia.org/T317375) [17:44:45] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:45:06] (03PS1) 10Bartosz Dziewoński: NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920733 (https://phabricator.wikimedia.org/T317375) [17:50:42] (03PS1) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T335361) [17:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:35] (03PS2) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) [17:57:02] (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [18:00:05] dancy and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1800). [18:00:05] dancy, hashar, and brennen: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T1800). [18:00:20] o/ [18:02:15] !log otto@deploy1002 Started deploy [analytics/refinery@fb22795]: Deploy for ProduceCanaryEvents fix - [analytics/refinery@fb22795] [18:03:36] !log train 1.41.0-wmf.9 (T330215): no current blockers, rolling to group1 as backup-backup conductor [18:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:40] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [18:03:46] (03CR) 10Ottomata: [C: 03+2] Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [18:03:50] (03PS1) 10Dzahn: extdist: switch git URLs from gerrit-replica to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) [18:04:44] hrm. wikiversions state is a bit odd here. [18:05:28] (03CR) 10Dzahn: "as Antoine pointed out on IRC after our meeting.. they actually DO use the gerrit-replica for some things (and dont for others). this is t" [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:05:35] Thanks brennen. [18:05:38] What's up w/ wikiversions? [18:05:48] just shows we've got some of group0 on the old version [18:05:58] (03Merged) 10jenkins-bot: Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [18:06:14] hmm.. that's worth debugging. [18:06:18] yeah [18:06:35] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:40] taking a look [18:06:41] $ grep -c 'wmf.9' ./wikiversions.json [18:06:43] 134 [18:10:33] (03PS6) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [18:11:30] !log otto@deploy1002 Finished deploy [analytics/refinery@fb22795]: Deploy for ProduceCanaryEvents fix - [analytics/refinery@fb22795] (duration: 09m 14s) [18:12:48] (03PS1) 10Ottomata: Bump refinery version for produce_canary_events job [puppet] - 10https://gerrit.wikimedia.org/r/920762 (https://phabricator.wikimedia.org/T330236) [18:14:44] (03PS2) 10Slyngshede: Offboarding: Allow managers to offboard users. [software/bitu] - 10https://gerrit.wikimedia.org/r/920665 (https://phabricator.wikimedia.org/T335476) [18:16:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:27] (03CR) 10Ottomata: mediawiki-page-content-change-enrichment: enable HA (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [18:16:48] (03CR) 10Ottomata: [C: 03+2] Bump refinery version for produce_canary_events job [puppet] - 10https://gerrit.wikimedia.org/r/920762 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [18:17:27] dancy: looks like just akwiki is missing [18:17:33] huh [18:17:51] recent addition? i don't know the language code offhand [18:18:42] https://phabricator.wikimedia.org/rOMWC424def8b8903dd159c3f64a533238fbe58d30105 [18:18:58] ah, moved to group0 as closed [18:19:03] ok, i think it's safe to proceed here. [18:19:14] whew! [18:19:35] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920763 (https://phabricator.wikimedia.org/T330215) [18:19:37] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920763 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:20:18] (03CR) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [18:20:21] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920763 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:22:45] (03PS1) 10Ottomata: produce_canary_events - fix refinery-job to use shaded jar [puppet] - 10https://gerrit.wikimedia.org/r/920764 (https://phabricator.wikimedia.org/T330236) [18:22:55] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:15] (03PS2) 10Ottomata: produce_canary_events - fix refinery-job to use shaded jar [puppet] - 10https://gerrit.wikimedia.org/r/920764 (https://phabricator.wikimedia.org/T330236) [18:24:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] produce_canary_events - fix refinery-job to use shaded jar [puppet] - 10https://gerrit.wikimedia.org/r/920764 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [18:25:37] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10Ladsgroup) >>! In T331356#8853449, @OlafJanssen wrote: > The sidebar is just one place the concept URI appears, how about all the concept... [18:27:44] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.9 refs T330215 [18:27:49] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [18:29:49] (03PS1) 10Dzahn: gerrit: make new lfs path the default and clean up [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) [18:30:25] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:07] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.9 refs T330215 (duration: 06m 22s) [18:34:12] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [18:34:13] (03CR) 10Dzahn: "we said on IRC we might try this but not today before holiday in Europe, this is for Monday or so.. next week but before Thursday when we " [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:34:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:57] (03CR) 10Dzahn: "Amir, do you know this one?" [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:36:46] (03Abandoned) 10Dzahn: site: remove gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/920749 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:37:01] taavi: puppet is erroring out on new vms in deployment-prep, and I think this is related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/907940. How is /etc/ssh/ca-key-id.txt meant to be created? [18:37:31] [ 209.654805] cloud-init[1403]: [1;31mError: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, No such file or directory @ rb_sysopen - /etc/ssh/ca-key-id.txt (file: /etc/puppet/modules/ssh/manifests/server/ca_signed_hostkey.pp, line: 25, column: 18) (file: [18:37:31] /etc/puppet/modules/ssh/manifests/server.pp, line: 92) on node deployment-cassandra01.deployment-prep.eqiad1.wikimedia.cloud[0m [18:37:35] (03CR) 10Ladsgroup: extdist: switch git URLs from gerrit-replica to gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:37:52] (03CR) 10Dzahn: [C: 04-2] "on hold for now" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:38:11] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:54] (03CR) 10Dzahn: "thanks! yea, more just wondering how much load this could possibly be.. but hard to tell" [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:39:35] urandom: it's created by puppet on the puppetmasters. unfortunately can't look into it more atm, I have a very early flight tomorrow morning to the hackathon [18:39:57] (03PS1) 10Ottomata: produce_canary_events - revert bump of refinery-job version [puppet] - 10https://gerrit.wikimedia.org/r/920767 (https://phabricator.wikimedia.org/T330236) [18:40:08] (03CR) 10Ottomata: [V: 03+2 C: 03+2] produce_canary_events - revert bump of refinery-job version [puppet] - 10https://gerrit.wikimedia.org/r/920767 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [18:41:03] (03PS1) 10Ladsgroup: Remove db1112 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/920768 (https://phabricator.wikimedia.org/T336332) [18:41:45] (03CR) 10Dzahn: "I hear this is used on canary hosts (originally I thought this as a different script only used by humans). Does serviceops think it's ok t" [puppet] - 10https://gerrit.wikimedia.org/r/919365 (https://phabricator.wikimedia.org/T216380) (owner: 10Dzahn) [18:43:11] (03CR) 10Dzahn: "@Jelto based on your previous comment and ultimately our meetin today where we deployed this already.. I assume I can just abandon this. I" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [18:43:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1112.eqiad.wmnet [18:43:30] (03Abandoned) 10Dzahn: add 15.wikipedia to cert and gateway hosts for miscweb behind istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [18:44:23] (03CR) 10Dzahn: "we will get back to this when you are back from Athens and we do the actual switch to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/761060 (owner: 10Dzahn) [18:44:58] (03CR) 10Dzahn: "we will do this soon, after the Hackathon week though" [puppet] - 10https://gerrit.wikimedia.org/r/761062 (owner: 10Dzahn) [18:45:21] (03CR) 10Ladsgroup: extdist: switch git URLs from gerrit-replica to gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:57] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:58] (03CR) 10Ahmon Dancy: [C: 03+1] logstash_checker.py: remove trusty-specific hacks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919365 (https://phabricator.wikimedia.org/T216380) (owner: 10Dzahn) [18:48:17] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [18:49:03] (03PS2) 10Andrew Bogott: envscripts: include OS_CLOUD in environment. [puppet] - 10https://gerrit.wikimedia.org/r/920753 [18:49:05] (03PS6) 10Andrew Bogott: cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [18:49:10] (03PS1) 10Ladsgroup: Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) [18:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:50:07] (03CR) 10Andrew Bogott: [C: 03+2] envscripts: include OS_CLOUD in environment. [puppet] - 10https://gerrit.wikimedia.org/r/920753 (owner: 10Andrew Bogott) [18:50:29] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: codfw1dev: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [18:53:45] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:03] (03PS1) 10Dzahn: gerrit2002: mask gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/920773 (https://phabricator.wikimedia.org/T334521) [18:57:51] (03CR) 10Dzahn: [C: 04-1] "to be merged on Thursday (or before?)" [puppet] - 10https://gerrit.wikimedia.org/r/920773 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:58:14] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1112.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [18:59:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1112.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [18:59:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1112.eqiad.wmnet [18:59:49] (03CR) 10Dzahn: [C: 03+1] "re: my previous comment. Today we have said in meeting the LFS data does not have to be copied to the replica host. So that means I actual" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [19:00:33] (03CR) 10Ladsgroup: [C: 03+2] Remove db1112 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/920768 (https://phabricator.wikimedia.org/T336332) (owner: 10Ladsgroup) [19:00:37] (03PS2) 10Ladsgroup: Remove db1112 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/920768 (https://phabricator.wikimedia.org/T336332) [19:00:39] (03CR) 10Ladsgroup: [V: 03+2] Remove db1112 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/920768 (https://phabricator.wikimedia.org/T336332) (owner: 10Ladsgroup) [19:01:34] !log Removing db1112 from zarcillo T336332 [19:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:39] T336332: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 [19:02:07] (03CR) 10Dzahn: [C: 03+1] "we ended up with this being a one-liner due to rebase and some duplicate work. this could now be as well part of https://gerrit.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [19:03:18] (03CR) 10Dzahn: "when this is merged also do https://gerrit.wikimedia.org/r/c/operations/puppet/+/908617 unless it's already done by then" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [19:03:36] 10ops-eqiad, 10decommission-hardware: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [19:06:09] (03PS7) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [19:11:25] (03CR) 10Dzahn: "so, it's definitely a good change (that infra security will also like) that we move the system user ID to this 9xx range. But it also need" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [19:13:53] mutante: I'd like to fix the UID first, that eases the switches 😃 [19:16:21] hashar: I thought it was easier to not have to touch anything on prod server and just fix the permissions one time during maintenance and then be done. but let's discuss again, after gerrit-replica is out of the way [19:17:00] (03PS10) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:17:25] (03PS1) 10Andrew Bogott: Revert "cloudservices: codfw1dev: enable cloud-private subnet" [puppet] - 10https://gerrit.wikimedia.org/r/920734 [19:17:35] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:17:44] I think that zuul one is straight forward :) [19:17:49] (03CR) 10CI reject: [V: 04-1] Revert "cloudservices: codfw1dev: enable cloud-private subnet" [puppet] - 10https://gerrit.wikimedia.org/r/920734 (owner: 10Andrew Bogott) [19:19:57] (03CR) 10Dzahn: "Do you still want to upgrade PHP version on contint buster hosts before we switch to bullseye hosts at this point? Do you see a real advan" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [19:20:09] hashar: sounds good:) [19:20:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:36] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:07] (03CR) 10Krinkle: arclamp: switch redis server to arclamp1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [19:36:53] (03PS1) 10Dwisehaupt: Shift frbast names to using the new hosts [dns] - 10https://gerrit.wikimedia.org/r/920777 (https://phabricator.wikimedia.org/T334505) [19:38:08] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:12] (03CR) 10CI reject: [V: 04-1] Shift frbast names to using the new hosts [dns] - 10https://gerrit.wikimedia.org/r/920777 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [19:38:47] (03CR) 10Krinkle: [C: 04-1] Remove innodb_lock_wait_timeout from the DatabaseMysqli SET statement in open() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918612 (owner: 10Aaron Schulz) [19:41:21] !log bking@wdqs2012 depooling to attempt firmware update T331297 [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:26] T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts - https://phabricator.wikimedia.org/T331297 [19:41:30] (03PS2) 10Herron: arclamp: switch redis server to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) [19:42:06] (03PS1) 10Ottomata: staging - Move flink-operator values to release specific values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/920778 (https://phabricator.wikimedia.org/T330507) [19:42:52] (03PS1) 10Dwisehaupt: Shift frbast names to using the new hosts [dns] - 10https://gerrit.wikimedia.org/r/920779 (https://phabricator.wikimedia.org/T334505) [19:43:41] (03Abandoned) 10Dwisehaupt: Shift frbast names to using the new hosts [dns] - 10https://gerrit.wikimedia.org/r/920777 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [19:44:35] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41222/console" [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [19:45:14] (03CR) 10Jgreen: [C: 03+2] Shift frbast names to using the new hosts [dns] - 10https://gerrit.wikimedia.org/r/920779 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [19:45:48] (03CR) 10Ottomata: [C: 03+2] staging - Move flink-operator values to release specific values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/920778 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [19:45:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:27] (03CR) 10Herron: arclamp: switch redis server to arclamp1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [19:47:59] (03Merged) 10jenkins-bot: staging - Move flink-operator values to release specific values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/920778 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [19:51:19] (03CR) 10CDanis: [C: 03+2] Replace my production ssh key with an ed25519 one. [puppet] - 10https://gerrit.wikimedia.org/r/920312 (https://phabricator.wikimedia.org/T336776) (owner: 10Jgreen) [19:52:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Change production access ssh key for Jeff Green / jgreen - https://phabricator.wikimedia.org/T336776 (10CDanis) 05Open→03Resolved a:03CDanis Will be live in half an hour [19:53:48] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10CDanis) [19:54:13] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs2012.codfw.wmnet [19:54:23] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:55:32] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:56:41] (03PS1) 10CDanis: manuel-wmde ssh access & analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/920780 (https://phabricator.wikimedia.org/T336841) [19:57:50] (03CR) 10CDanis: [C: 03+2] manuel-wmde ssh access & analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/920780 (https://phabricator.wikimedia.org/T336841) (owner: 10CDanis) [19:58:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10CDanis) 05Open→03Resolved a:03CDanis Will be live in half an hour. Please re-open the task if you have any trouble with SSH access! [20:00:05] (03PS1) 10DCausse: flink-session-cluster: fix prom reporter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920781 (https://phabricator.wikimedia.org/T336872) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T2000). [20:00:06] jan_drewniak, sergi0, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:17] Hello [20:00:20] hi [20:00:33] Hi, I can deploy today [20:01:40] (03PS2) 10Urbanecm: GrowthExperiments: enable add link frontend in 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920722 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:01:43] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: enable add link frontend in 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920722 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:02:00] (03CR) 10Urbanecm: [C: 03+2] NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920732 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [20:02:14] MatmaRex: do you want me to do the .9 backport too? or only the .8? [20:02:27] (03Merged) 10jenkins-bot: GrowthExperiments: enable add link frontend in 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920722 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:02:54] urbanecm: we can do .9 as well, but i didn't want to take up the entire window https://gerrit.wikimedia.org/r/q/project:mediawiki/extensions/DiscussionTools+branch:wmf/1.41.0-wmf.9+status:open [20:03:01] O/ [20:03:09] (and i don't need them right now, it'd just be nice for consistency) [20:03:10] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:920722|GrowthExperiments: enable add link frontend in 9th round wikis (T308134)]] [20:03:11] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - Revert deploy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920782 (https://phabricator.wikimedia.org/T330507) [20:03:15] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [20:03:24] i can do them at once, and it won't take a lot of additional time, so I'll go for it too. [20:03:25] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10CDanis) >>! In T336701#8852464, @Dzahn wrote: > Thank you! Both URLs work for me and appear to be valid feeds :) Should they bee added to planet, btw? Seems reasonable to me. [20:03:27] thanks [20:03:32] (03CR) 10Urbanecm: [C: 03+2] NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920733 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [20:03:53] ...or maybe not, as it depends on other non-merged patches [20:03:56] MatmaRex: ^^ [20:04:46] !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:920722|GrowthExperiments: enable add link frontend in 9th round wikis (T308134)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:05:01] hi sergi0: your patch is available at mwdebug1002. can you test please? [20:05:11] testing now [20:06:12] (03PS5) 10Urbanecm: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [20:06:20] urbanecm: hey, sorry I was running late, I'm here for my deploy though [20:06:39] jan_drewniak: hey, thanks for the info! I'll ping you once your patch is ready. [20:06:45] urbanecm: yes, that's what i meant, those 3 wmf.9 patches are already in wmf.8 and master [20:06:53] ah, makes sense [20:07:05] we can backport them for consistency, but it's not needed [20:07:09] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs2012.codfw.wmnet [20:07:13] okay [20:07:14] noted [20:07:24] (03Merged) 10jenkins-bot: NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920732 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [20:07:56] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:06] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - Revert deploy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920782 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [20:08:08] (03CR) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 9th round wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920722 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:08:48] urbanecm: tested 4 wikis (hif, gor, ilo, jam), things looking good, however I just noticed the wrong jbo entry in the patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/920722/2/wmf-config/InitialiseSettings.php#12445. I'm gonna ammend that now, I am not sure if it's harmful. [20:09:12] sergi0: afaik that should enable the frontend for _all_ projects in the jbo language. yes please, a followup would be helpful. [20:09:18] and sorry for missing this :) [20:09:30] deploying in the meantime! [20:10:16] (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment - Revert deploy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920782 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [20:12:15] (03PS1) 10Sergio Gimeno: GrowthExperiments: amend wrong wiki prefix for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) [20:12:26] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs2012.codfw.wmnet [20:12:29] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:12:34] urbanecm: hey I'm sitting in for Jan. Feel free to ping me when our deploy is ready. [20:13:08] kimberly_sarabia: okay, will do! and nice to meet you Kimberly! [20:13:24] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs2012.codfw.wmnet [20:13:42] MatmaRex: i assume the backport's not testable, right? [20:13:48] (03CR) 10Urbanecm: [C: 03+2] Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [20:13:57] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: amend wrong wiki prefix for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:13:59] urbanecm: luckly the entry was set to false so should not enable anything. Sorry for that. [20:14:04] no worries [20:14:08] i'll deploy the fix [20:14:10] urbanecm: nope, it's just a maintenance script [20:14:20] ack ack [20:14:21] (03CR) 10Hashar: "I have checked the Apache log from https://logstash.wikimedia.org/goto/d09f150e63158c05d59e939def231b64 for the last 24 hours we had:" [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [20:14:36] (03Merged) 10jenkins-bot: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [20:15:17] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:920722|GrowthExperiments: enable add link frontend in 9th round wikis (T308134)]] (duration: 12m 06s) [20:15:21] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [20:15:24] sergi0: first patch's live [20:15:33] (03PS2) 10Urbanecm: GrowthExperiments: amend wrong wiki prefix for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:15:33] ack [20:15:36] (03CR) 10Urbanecm: GrowthExperiments: amend wrong wiki prefix for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:15:38] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: amend wrong wiki prefix for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:15:40] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:50] (03CR) 10Ottomata: mediawiki-page-content-change-enrichment: enable HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:16:26] (03Merged) 10jenkins-bot: GrowthExperiments: amend wrong wiki prefix for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:17:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920784 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [20:17:29] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:920784|GrowthExperiments: amend wrong wiki prefix for jbowiki (T308134)]], [[gerrit:920732|NewTopicOptOutActiveUsers: Skip bot users etc. (T317375)]], [[gerrit:920386|Enable zebra ab test in hewiki (T335972)]] [20:17:37] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [20:17:38] T335972: Launch content separation (Zebra #9) A/B test - https://phabricator.wikimedia.org/T335972 [20:19:02] anyone here who has been using reimage cookbook a lot lately? [20:19:03] !log urbanecm@deploy1002 urbanecm and matmarex and ksarabia and sgimeno: Backport for [[gerrit:920784|GrowthExperiments: amend wrong wiki prefix for jbowiki (T308134)]], [[gerrit:920732|NewTopicOptOutActiveUsers: Skip bot users etc. (T317375)]], [[gerrit:920386|Enable zebra ab test in hewiki (T335972)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw. [20:19:04] wmnet [20:19:18] kimberly_sarabia: jan_drewniak: your patch is at mwdebug1002. can you test it please? [20:19:24] what I want is the numbers how long it took you [20:19:50] urbanecm: thank you. sure will do [20:22:09] seeing a lot of "resourceloader: Client and server registry version out of sync" in the logs. [20:22:33] that's normal after a deployment i think [20:22:59] i think so too, thanks for confirming. [20:23:22] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:37] urbanecm: looks good to sync [20:23:43] awesome, proceeding [20:24:27] (03CR) 10Hashar: zuul: switch to fixed uid/gid 923 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [20:27:37] (03CR) 10Dzahn: zuul: switch to fixed uid/gid 923 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [20:29:06] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:920784|GrowthExperiments: amend wrong wiki prefix for jbowiki (T308134)]], [[gerrit:920732|NewTopicOptOutActiveUsers: Skip bot users etc. (T317375)]], [[gerrit:920386|Enable zebra ab test in hewiki (T335972)]] (duration: 11m 36s) [20:29:13] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [20:29:13] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [20:29:13] T335972: Launch content separation (Zebra #9) A/B test - https://phabricator.wikimedia.org/T335972 [20:29:20] (03CR) 10Dzahn: zuul: switch to fixed uid/gid 923 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [20:29:23] kimberly_sarabia: your patch should be deployed [20:29:35] sergi0: and the fix for link recommendation's deployed too [20:29:40] now onto the script! [20:29:45] urbanecm: thanks! [20:29:54] urbanecm: nice, thank you! [20:30:42] (03CR) 10Dzahn: "yea, instructions make sense! I can just do this by myself anytime too. no problem." [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [20:30:56] MatmaRex: https://phabricator.wikimedia.org/P48360 is the output [20:31:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:11] anything else? [20:31:34] urbanecm: looks great, thank you [20:31:37] no problem [20:31:58] (i'll need to schedule the "real" run with my team) [20:32:08] okay. so not to be done today :) [20:32:38] !log UTC late B&C window done [20:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:06] (03CR) 10Dzahn: "would you like me to just do this whenever I find the time or do you want to be present" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [20:38:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:30] (03CR) 10Dzahn: extdist: switch git URLs from gerrit-replica to gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [20:41:14] (03CR) 10Dzahn: "from me personally it's currently "soft -1"" [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [20:43:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:09] (03PS1) 10Ladsgroup: mwscript: Avoid prepending maintenance/ if >= 2 dots in argument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920788 (https://phabricator.wikimedia.org/T336819) [20:45:54] (03PS1) 10Andrew Bogott: cloud_private_subnet: allow hiera overrides of the nic used for the vlan [puppet] - 10https://gerrit.wikimedia.org/r/920789 [20:46:14] (03PS11) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [20:46:24] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:46] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [20:47:58] (03PS2) 10Andrew Bogott: cloud_private_subnet: allow hiera overrides of the nic used for the vlan [puppet] - 10https://gerrit.wikimedia.org/r/920789 [20:48:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:07] (03CR) 10Cathal Mooney: "small nits but overall looks good I think." [puppet] - 10https://gerrit.wikimedia.org/r/920789 (owner: 10Andrew Bogott) [20:50:16] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: allow hiera overrides of the nic used for the vlan [puppet] - 10https://gerrit.wikimedia.org/r/920789 (owner: 10Andrew Bogott) [20:51:10] 10SRE-swift-storage, 10Commons, 10Tracking-Neverending: Thumbnail/imagescaler (tracking) - https://phabricator.wikimedia.org/T43371 (10TheDJ) [20:51:58] (03CR) 10Cathal Mooney: cloud_private_subnet: allow hiera overrides of the nic used for the vlan (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920789 (owner: 10Andrew Bogott) [20:52:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:57] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [20:59:54] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2012.codfw.wmnet [21:00:16] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:37] !log mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Public policy" "Global Advocacy" "Zabe" --reason "per request [[:phab:T333842|T333842]]" [21:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:41] T333842: Rename "Public policy" to "Global Advocacy" (and subpages) on meta-wiki - https://phabricator.wikimedia.org/T333842 [21:07:03] (03CR) 10Andrew Bogott: cloud_private_subnet: allow hiera overrides of the nic used for the vlan (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/920789 (owner: 10Andrew Bogott) [21:07:18] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [21:07:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:03] (03PS3) 10Andrew Bogott: cloud_private_subnet: allow hiera overrides of the nic used for the vlan [puppet] - 10https://gerrit.wikimedia.org/r/920789 [21:13:16] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/920789 (owner: 10Andrew Bogott) [21:13:20] 10SRE-swift-storage, 10Commons, 10Tracking-Neverending: Thumbnail/imagescaler (tracking) - https://phabricator.wikimedia.org/T43371 (10TheDJ) [21:13:30] (03CR) 10Andrew Bogott: [C: 03+2] cloud_private_subnet: allow hiera overrides of the nic used for the vlan [puppet] - 10https://gerrit.wikimedia.org/r/920789 (owner: 10Andrew Bogott) [21:15:42] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:56] PROBLEM - Host wdqs2012 is DOWN: PING CRITICAL - Packet loss = 100% [21:23:18] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:05] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Update extra plugin to 7.10.2-wmf8 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/912995 (https://phabricator.wikimedia.org/T332355) (owner: 10Ebernhardson) [21:26:40] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [21:26:54] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [21:30:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:37] (03PS1) 10Jdrewniak: Enable Veector AB test on beta spanish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) [21:37:54] (03PS2) 10Jdrewniak: [Beta] Enable Vector AB test on beta spanish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) [21:38:50] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:42] (03PS1) 10BCornwall: doh: Clearer expression of service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) [21:40:50] RECOVERY - Host wdqs2012 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [21:40:55] (03PS1) 10Andrew Bogott: cloudlb2001: use new cloud-private vlan addresses for designate [puppet] - 10https://gerrit.wikimedia.org/r/920795 (https://phabricator.wikimedia.org/T336808) [21:40:57] hey all, I know it's outside deploy windows, but could deploy a beta config change? (we're trying to get an AB test working :/ ) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/920793 [21:41:07] (03Abandoned) 10Andrew Bogott: Revert "cloudservices: codfw1dev: enable cloud-private subnet" [puppet] - 10https://gerrit.wikimedia.org/r/920734 (owner: 10Andrew Bogott) [21:41:37] (03CR) 10Andrew Bogott: [C: 03+2] cloudlb2001: use new cloud-private vlan addresses for designate [puppet] - 10https://gerrit.wikimedia.org/r/920795 (https://phabricator.wikimedia.org/T336808) (owner: 10Andrew Bogott) [21:42:02] (03CR) 10CI reject: [V: 04-1] doh: Clearer expression of service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [21:42:48] (03CR) 10Cathal Mooney: [C: 03+1] "Hostnames are resolving to the 172.20.5.x addressing so seems ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/920795 (https://phabricator.wikimedia.org/T336808) (owner: 10Andrew Bogott) [21:44:47] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2012.codfw.wmnet [21:45:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:22] (03PS2) 10BCornwall: doh: Clearer expression of service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) [21:51:02] jan_drewniak: for some reason npm run-script build-all-portals do not generate any content on my machine, so I guess I'll let PortalsBuilder do it for me [21:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:15] hauskater: yeah sorry, that repo needs some major maintenance (around the data fetching portion specifically) :/ [21:52:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:38] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41228/console" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [21:58:50] (03PS2) 10Krinkle: Enable First Input Delay events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918348 (https://phabricator.wikimedia.org/T332012) (owner: 10Phedenskog) [22:00:38] * Krinkle staging on mwdebug1002 [22:00:38] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:41] (03CR) 10Krinkle: [C: 03+2] Enable First Input Delay events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918348 (https://phabricator.wikimedia.org/T332012) (owner: 10Phedenskog) [22:01:34] (03Merged) 10jenkins-bot: Enable First Input Delay events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918348 (https://phabricator.wikimedia.org/T332012) (owner: 10Phedenskog) [22:08:16] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:13] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) @aborrero based on our conversation on irc we can use this range for private VIPs in codfw: https:/... [22:15:03] !log krinkle@deploy1002 Synchronized wmf-config/: T332012 (duration: 06m 51s) [22:15:07] T332012: Collect first input delay - https://phabricator.wikimedia.org/T332012 [22:15:56] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:23:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:45] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:29:08] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove new openstack.codfw1dev.wikimediacloud.org name server A records. - cmooney@cumin1001" [22:30:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove new openstack.codfw1dev.wikimediacloud.org name server A records. - cmooney@cumin1001" [22:30:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:31:14] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:50] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:58] 10SRE, 10Inuka-Team, 10Wikipedia-Preview: Add both Wikipedia Preview repos to Packagist - https://phabricator.wikimedia.org/T310938 (10Varnent) @bd808 and @Krinkle - are you still the maintainers of the [[https://www.mediawiki.org/wiki/Manual:Developing_libraries#Packagist_guidelines | wikimedia account]]? [22:46:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:36] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:49:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:52:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:34] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:20] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:44] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:16] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:54] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:04] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:54] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:58] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state