[00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941061 [00:38:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941061 (owner: 10TrainBranchBot) [00:40:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941062 [00:40:42] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941062 (owner: 10TrainBranchBot) [00:55:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941061 (owner: 10TrainBranchBot) [00:56:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941062 (owner: 10TrainBranchBot) [01:07:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [01:07:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [02:00:20] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:18] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:16] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:18] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:16] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:47] (03PS1) 10JHathaway: DO NOT MERGE: Remove hostname from ssh known_hosts aliases [puppet] - 10https://gerrit.wikimedia.org/r/941543 [02:41:51] (03CR) 10JHathaway: "I asked on IRC as to why we include the hostname, but no one replied, does anyone know why? We could of course create another template or " [puppet] - 10https://gerrit.wikimedia.org/r/941543 (owner: 10JHathaway) [02:50:06] (03CR) 10Andrea Denisse: [C: 03+1] prometheus: add recording rules for cadvisor cpu/mem [puppet] - 10https://gerrit.wikimedia.org/r/940879 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [02:50:31] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/940879 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [03:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:24:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:59:38] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:59:58] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:52] (03PS1) 10Andrew Bogott: Horizon/docker: update image version [puppet] - 10https://gerrit.wikimedia.org/r/941549 [04:22:58] (03Abandoned) 10Andrew Bogott: DO NOT MERGE, this is just a proof of concept [puppet] - 10https://gerrit.wikimedia.org/r/941454 (owner: 10Andrew Bogott) [04:24:10] (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: update image version [puppet] - 10https://gerrit.wikimedia.org/r/941549 (owner: 10Andrew Bogott) [04:27:18] (03CR) 10Giuseppe Lavagetto: "Sorry, I didn't understand your question yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/941543 (owner: 10JHathaway) [04:33:50] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_ipinfo.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:24] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:04:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes: add mw-misc "service" [puppet] - 10https://gerrit.wikimedia.org/r/940186 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:09:37] (03PS5) 10Giuseppe Lavagetto: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T0600) [06:02:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:04:25] (03Merged) 10jenkins-bot: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:15:44] !log oblivian@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [06:17:28] !log oblivian@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [06:17:48] !log oblivian@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [06:18:26] !log oblivian@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [06:18:37] (03PS1) 10Andrea Denisse: xhgui: Remove xhgui1001 and xhgui1002 node definitions [puppet] - 10https://gerrit.wikimedia.org/r/941550 (https://phabricator.wikimedia.org/T342724) [06:21:12] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts xhgui2001.codfw.wmnet,xhgui1001.eqiad.wmnet [06:24:08] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [06:25:44] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [06:26:38] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [06:30:19] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [06:31:06] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [06:33:10] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: xhgui2001.codfw.wmnet,xhgui1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [06:34:07] !log Stop mariadb on clouddb1021 T334651 [06:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:11] T334651: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 [06:34:22] (03PS2) 10Giuseppe Lavagetto: Add mw-misc service under ingress [dns] - 10https://gerrit.wikimedia.org/r/941403 (https://phabricator.wikimedia.org/T341859) [06:34:57] (03PS1) 10Marostegui: clouddb1021: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/941551 (https://phabricator.wikimedia.org/T334651) [06:35:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add mw-misc service under ingress [dns] - 10https://gerrit.wikimedia.org/r/941403 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:37:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: xhgui2001.codfw.wmnet,xhgui1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [06:37:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:37:48] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts xhgui2001.codfw.wmnet,xhgui1001.eqiad.wmnet [06:42:36] (03CR) 10Marostegui: [C: 03+2] clouddb1021: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/941551 (https://phabricator.wikimedia.org/T334651) (owner: 10Marostegui) [06:47:32] !log oblivian@cumin1001 START - Cookbook sre.dns.netbox [06:48:51] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:49:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941429 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:51:30] (03CR) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 (owner: 10Giuseppe Lavagetto) [06:51:35] (03PS5) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 [07:00:04] Amir1, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:40] (03CR) 10Filippo Giunchedi: [C: 03+1] xhgui: Remove xhgui1001 and xhgui1002 node definitions [puppet] - 10https://gerrit.wikimedia.org/r/941550 (https://phabricator.wikimedia.org/T342724) (owner: 10Andrea Denisse) [07:01:42] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add recording rules for cadvisor cpu/mem [puppet] - 10https://gerrit.wikimedia.org/r/940879 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [07:03:39] (03CR) 10Filippo Giunchedi: [C: 03+1] role::kafka::logging: apply threads settings to brokers [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey) [07:04:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 (owner: 10Giuseppe Lavagetto) [07:06:45] (03Merged) 10jenkins-bot: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 (owner: 10Giuseppe Lavagetto) [07:07:57] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [07:08:20] (03PS2) 10Volans: sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 [07:08:24] (03PS2) 10Volans: sre.hosts.decommission: downtime mgmt only in AM [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 [07:08:26] (03PS1) 10Volans: sre.dns.netbox: use cumin alias for Netbox hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/941757 [07:08:28] (03PS1) 10Volans: Use only active authdns hosts for DNS changes [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 [07:08:30] (03PS1) 10Volans: sre.hosts.decommission: skip site.pp for matches [cookbooks] - 10https://gerrit.wikimedia.org/r/941759 (https://phabricator.wikimedia.org/T297516) [07:08:32] (03PS1) 10Volans: sre.hosts.decommission: search in the DNS repo too [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 [07:08:42] (03PS1) 10Alexandros Kosiaris: Rebuild for T340087, aka wikidiff2 1.14.1 deployment [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941761 (https://phabricator.wikimedia.org/T340087) [07:10:08] (03CR) 10Volans: Use only active authdns hosts for DNS changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [07:14:11] (03CR) 10Ayounsi: [C: 03+1] sre.dns.netbox: use cumin alias for Netbox hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/941757 (owner: 10Volans) [07:17:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (releases1002, ...), Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:18:06] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [07:20:17] (03PS1) 10Giuseppe Lavagetto: mediawiki::releases: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941772 [07:20:25] <_joe_> jouncebot: now [07:20:25] For the next 0 hour(s) and 39 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T0700) [07:20:39] <_joe_> oh no patches, great [07:21:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::releases: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941772 (owner: 10Giuseppe Lavagetto) [07:27:14] (03PS1) 10Giuseppe Lavagetto: mediawiki-on-k8s: properly assign debug image to mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941773 [07:27:29] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki-on-k8s: properly assign debug image to mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941773 (owner: 10Giuseppe Lavagetto) [07:29:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::catalog: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:29:37] (03PS1) 10Giuseppe Lavagetto: mw-on-k8s: also define the web_flavour for mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941774 [07:30:13] (03CR) 10Jelto: [C: 03+1] "lgtm, see previous comment" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:30:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-on-k8s: also define the web_flavour for mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941774 (owner: 10Giuseppe Lavagetto) [07:30:31] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [07:35:43] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [07:35:43] (03PS3) 10Alexandros Kosiaris: wmnet: Add cnames for wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:35:56] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [07:36:46] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [07:36:50] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bullseye [07:37:03] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [07:37:17] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:42:16] (03PS1) 10Alexandros Kosiaris: service::catalog: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/941775 (https://phabricator.wikimedia.org/T297314) [07:42:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 've added the discovery record as well, LGTM, merging" [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:46:03] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) [07:46:31] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [07:47:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Rebuild for T340087, aka wikidiff2 1.14.1 deployment [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941761 (https://phabricator.wikimedia.org/T340087) (owner: 10Alexandros Kosiaris) [07:48:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:49:03] !log updating bookworm netboot image for point release 12.1 ( https://wikitech.wikimedia.org/wiki/Updating_netboot_image_with_newer_kernel#Updating_production_point_release ) [07:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:33] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:54:22] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::logging: apply threads settings to brokers [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey) [07:56:40] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [07:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:58:54] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:00:04] jnuche and dancy: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T0800). [08:00:23] morning, I'll roll out the train to group1 in ~5m [08:01:44] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [08:02:02] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Rebuild for T340087, aka wikidiff2 1.14.1 deployment [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941761 (https://phabricator.wikimedia.org/T340087) (owner: 10Alexandros Kosiaris) [08:05:56] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941779 (https://phabricator.wikimedia.org/T340247) [08:05:58] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941779 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [08:06:40] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941779 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [08:11:02] (03PS1) 10Elukey: changeprop: bump node-rdkafka, use buster base (prod version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941780 (https://phabricator.wikimedia.org/T341140) [08:11:43] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Volans) I can't reproduce this with bullseye, the reimage works fine with it. I tried to reproduce it with bookworm on `sretest1002` but I got an unrelated error... [08:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:13:50] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) I 've gone ahead and created https:/... [08:14:19] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.19 refs T340247 [08:14:23] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [08:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:23:46] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:34:15] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.19 refs T340247 (duration: 19m 56s) [08:34:19] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [08:34:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:53] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) I 've gone ahead and populated the S... [08:36:57] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [08:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:39:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:43:01] 10SRE, 10serviceops: Request to block ActionApi client (based on a specific user agent header) - https://phabricator.wikimedia.org/T243858 (10akosiaris) 05Open→03Declined I am gonna close this as declined. While we do have the ability to block requests based on user-agent, we don't do that on request. [08:49:38] (03CR) 10JMeybohm: wmnet: Add cnames for wikifunctions ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [08:51:22] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9043540, @akosi... [08:51:30] (03PS4) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 [08:52:20] (03CR) 10Hashar: Recognize ~/.config/docker-pkg.yaml (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [08:55:00] (03PS1) 10Filippo Giunchedi: prometheus: tune cadvisor rules [puppet] - 10https://gerrit.wikimedia.org/r/941839 (https://phabricator.wikimedia.org/T108027) [08:55:20] (03CR) 10Elukey: [C: 03+2] httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/941449 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [08:56:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:58:42] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:28] (03PS1) 10Elukey: role::kafka::jumbo: apply thread settings [puppet] - 10https://gerrit.wikimedia.org/r/941840 [09:00:44] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: tune cadvisor rules [puppet] - 10https://gerrit.wikimedia.org/r/941839 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:03:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42705/console" [puppet] - 10https://gerrit.wikimedia.org/r/941840 (owner: 10Elukey) [09:04:36] (03CR) 10Hashar: [C: 04-1] "Bullseye no more provides `python-pip` :-\" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [09:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:13:23] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) Preliminary dashboard to explore per-unit resource usage: https://grafana.wikim... [09:13:26] (03CR) 10DCausse: flink-zk: Initiate new flink::zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [09:13:31] (03CR) 10Hnowlan: [C: 03+1] changeprop: bump node-rdkafka, use buster base (prod version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941780 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [09:14:59] (03PS1) 10Jforrester: Localisation updates from https://translatewiki.net. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941842 [09:15:05] (03PS1) 10Jforrester: docs: Move Vue top-level comment out of the rendered DOM [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941843 [09:15:11] (03PS1) 10Jforrester: Set label changes to true when updating Function Description [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941844 (https://phabricator.wikimedia.org/T342596) [09:15:17] (03PS1) 10Jforrester: docs: Ensure we have a proper file-level block on every code file [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941845 [09:15:23] (03PS1) 10Jforrester: Don't hard-code parantheses, as they differ by language [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941846 [09:15:29] (03PS1) 10Jforrester: ExpandedToggle: Rotate icon the other way on dir=rtl context [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941847 (https://phabricator.wikimedia.org/T337988) [09:15:35] (03PS1) 10Jforrester: Add exit cancel handling for browser events [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941848 (https://phabricator.wikimedia.org/T340627) [09:15:41] (03PS1) 10Jforrester: About widget: Tie disabled state of edit pencil button to canEditObject method [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941849 (https://phabricator.wikimedia.org/T329982) [09:15:47] (03PS1) 10Jforrester: Handle oldid url param to view a particular revision [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941850 (https://phabricator.wikimedia.org/T287514) [09:15:53] (03PS1) 10Jforrester: AUTHORS: Update for July 2023 [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941851 [09:15:59] (03PS1) 10Jforrester: Update function-schemata sub-module to HEAD (1c01f22) [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941852 (https://phabricator.wikimedia.org/T335583) [09:16:05] (03PS1) 10Jforrester: PageRenderingHandler: Don't make 'read' selected if we're on the edit tab [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941853 [09:16:11] (03PS1) 10Jforrester: Localisation updates from https://translatewiki.net. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941854 [09:18:55] (03PS1) 10Filippo Giunchedi: prometheus: add rss/swap/cache memory aggregates [puppet] - 10https://gerrit.wikimedia.org/r/941855 (https://phabricator.wikimedia.org/T108027) [09:19:09] (03CR) 10CI reject: [V: 04-1] prometheus: add rss/swap/cache memory aggregates [puppet] - 10https://gerrit.wikimedia.org/r/941855 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:19:50] (03PS2) 10Filippo Giunchedi: prometheus: add rss/swap/cache memory aggregates [puppet] - 10https://gerrit.wikimedia.org/r/941855 (https://phabricator.wikimedia.org/T108027) [09:26:08] (03PS1) 10Jforrester: services_proxy: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941856 (https://phabricator.wikimedia.org/T297314) [09:27:19] train blocker, doesn't seem to affect users so I'm not rolling back for now: https://phabricator.wikimedia.org/T342733 [09:28:09] (03PS2) 10Jforrester: service::catalog: Switch wikifunctions to state production [puppet] - 10https://gerrit.wikimedia.org/r/941314 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [09:29:41] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941852 (https://phabricator.wikimedia.org/T335583) (owner: 10Jforrester) [09:29:59] (03CR) 10Jforrester: "Dupe of Ia7d375d8d9e643e390e931fd47e1a3502f7d53d6." [puppet] - 10https://gerrit.wikimedia.org/r/941775 (https://phabricator.wikimedia.org/T297314) (owner: 10Alexandros Kosiaris) [09:30:15] Krinkle, duesen: ^^ blocker you may know something about, in case you're around [09:34:47] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add rss/swap/cache memory aggregates [puppet] - 10https://gerrit.wikimedia.org/r/941855 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:42:39] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941850 (https://phabricator.wikimedia.org/T287514) (owner: 10Jforrester) [09:52:32] (03CR) 10Kamila Součková: [C: 03+1] changeprop: bump node-rdkafka, use buster base (prod version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941780 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [09:58:36] (GitLabCIJobErrors) firing: GitLab - High CI job error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIJobErrors [09:59:40] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1000) [10:00:30] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:28] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941846 (owner: 10Jforrester) [10:03:36] (GitLabCIJobErrors) resolved: GitLab - High CI job error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIJobErrors [10:15:22] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941842 (owner: 10Jforrester) [10:17:10] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [10:21:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941856 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [10:22:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:22:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:22:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T342617)', diff saved to https://phabricator.wikimedia.org/P49708 and previous config saved to /var/cache/conftool/dbconfig/20230726-102232-ladsgroup.json [10:22:36] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:24:17] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Fabfur) The same error goes on lvs1016 now, but didn't on previous hosts. Confirm that we're not running custom kernel (AFAIK)... [10:25:40] jnuche: ad https://phabricator.wikimedia.org/T342733#9043895, i think that bug is worth rolling back. [10:27:36] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1016.eqiad.wmnet with OS bookworm [10:31:10] !log eoghan@cumin1001 START - Cookbook sre.hosts.decommission for hosts releases2002.codfw.wmnet [10:31:20] urbanecm: agreed, I'll roll back as soon as the MW infra window is over [10:34:34] (03PS1) 10Jforrester: wikifunctions: Configure service_proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) [10:44:08] (03CR) 10Alexandros Kosiaris: wikifunctions: Configure service_proxy port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [10:45:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] wikifunctions: Configure service_proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [10:45:52] jnuche: might be https://gerrit.wikimedia.org/r/c/mediawiki/core/+/935921 [10:46:11] I'm afk for a while now. Krinkle and TimStarling know more [10:47:13] duesen: will wait for them then, thanks for looking [10:47:20] it seems to be either that one you mention or https://gerrit.wikimedia.org/r/c/mediawiki/core/+/941030 [10:50:14] (03PS2) 10Alexandros Kosiaris: service::catalog: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/941775 (https://phabricator.wikimedia.org/T297314) [10:50:16] (03PS1) 10Alexandros Kosiaris: wikifunctions: Add to enabled_listeners, fix ingress spec [puppet] - 10https://gerrit.wikimedia.org/r/941888 [10:53:39] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [10:55:42] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: releases2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [10:56:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bullseye [10:56:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifunctions: Add to enabled_listeners, fix ingress spec [puppet] - 10https://gerrit.wikimedia.org/r/941888 (owner: 10Alexandros Kosiaris) [10:56:48] (03PS2) 10Alexandros Kosiaris: wikifunctions: Add to enabled_listeners, fix ingress spec [puppet] - 10https://gerrit.wikimedia.org/r/941888 [10:56:51] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: releases2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [10:56:51] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:52] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts releases2002.codfw.wmnet [10:57:18] !log eoghan@cumin1001 START - Cookbook sre.hosts.decommission for hosts releases1002.eqiad.wmnet [11:00:46] I'll roll back in 10 mins [11:00:58] (the train to group0) [11:01:25] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [11:03:59] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: releases1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [11:05:14] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: releases1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [11:05:14] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:05:15] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts releases1002.eqiad.wmnet [11:09:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T342617)', diff saved to https://phabricator.wikimedia.org/P49710 and previous config saved to /var/cache/conftool/dbconfig/20230726-110948-ladsgroup.json [11:09:52] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:11:19] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941898 (https://phabricator.wikimedia.org/T340247) [11:11:22] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941898 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [11:12:08] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941898 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [11:14:52] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1016.eqiad.wmnet with OS bullseye [11:18:29] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:19:04] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.19 refs T340247 [11:19:08] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [11:19:55] (03CR) 10Klausman: DO NOT MERGE: Remove hostname from ssh known_hosts aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941543 (owner: 10JHathaway) [11:23:44] ty for the rollback jnuche; Commons is back up for my test acc. [11:24:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P49711 and previous config saved to /var/cache/conftool/dbconfig/20230726-112454-ladsgroup.json [11:25:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bullseye [11:28:36] urbanecm: np :) [11:29:16] and i noticed a new train blocker... filling now :). [11:29:54] ah damn [11:32:31] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Clement_Goubert) We've been experiencing throttling on mw-api-int and raising the container's CPU limit has helped, but not fixe... [11:32:41] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1016.eqiad.wmnet with OS bullseye [11:32:43] filled as T342747 [11:32:44] T342747: Special:UserRights in interwiki mode: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'metawiki' - https://phabricator.wikimedia.org/T342747 [11:33:32] thx [11:34:54] (03PS1) 10Clément Goubert: mw-api-int: Raise number of replicas to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/941900 (https://phabricator.wikimedia.org/T342252) [11:40:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P49712 and previous config saved to /var/cache/conftool/dbconfig/20230726-114001-ladsgroup.json [11:40:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [11:42:41] (03PS2) 10Jforrester: wikifunctions: Configure service_proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) [11:42:43] (03CR) 10Jforrester: wikifunctions: Configure service_proxy port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [11:43:27] (03CR) 10Jforrester: [C: 03+2] WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [11:43:43] (03CR) 10Jforrester: "Argh, sorry, mis-click." [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [11:43:58] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Configure service_proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [11:44:31] (03Merged) 10jenkins-bot: wikifunctions: Configure service_proxy port [deployment-charts] - 10https://gerrit.wikimedia.org/r/941865 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [11:45:53] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:46:30] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:47:25] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:48:38] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:48:49] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:50:09] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:51:01] (03PS1) 10Hnowlan: images: enable "debug" on memcache, log when servers are dead [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/941901 (https://phabricator.wikimedia.org/T341805) [11:51:27] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10RobH) [11:51:42] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10RobH) [11:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T342617)', diff saved to https://phabricator.wikimedia.org/P49713 and previous config saved to /var/cache/conftool/dbconfig/20230726-115507-ladsgroup.json [11:55:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:55:14] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T342617)', diff saved to https://phabricator.wikimedia.org/P49714 and previous config saved to /var/cache/conftool/dbconfig/20230726-115528-ladsgroup.json [11:56:21] (03CR) 10CI reject: [V: 04-1] images: enable "debug" on memcache, log when servers are dead [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/941901 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [11:56:26] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero) hey @Jhancock.wm they need to be connected to a cloudsw device. The only such device we have in codfw is in rack `B1`... [11:57:24] (03PS11) 10Jforrester: Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [11:57:26] (03PS15) 10Jforrester: Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [11:57:28] (03PS9) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [11:57:30] (03PS7) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [11:57:32] (03PS13) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [11:57:34] (03PS2) 10Jforrester: [DNM] Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 [11:57:36] (03PS1) 10Jforrester: ProductionServices: Define the wikifunctions orchestrator access point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941902 (https://phabricator.wikimedia.org/T297314) [11:57:51] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero) [11:58:42] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero) [12:00:13] (03PS1) 10Clément Goubert: team-sre: Raise KubeletOperationalLatency threshold for list_images [alerts] - 10https://gerrit.wikimedia.org/r/941903 (https://phabricator.wikimedia.org/T342250) [12:00:23] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero) [12:01:43] (03PS4) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) [12:03:02] (03CR) 10Hashar: "I went to install pip from the source tarball (with a sha256sum validation). The resulting image builds locally and I have managed to use " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [12:17:33] (03PS2) 10JMeybohm: KubeletOperationalLatency: Raise threshold for list_images operations [alerts] - 10https://gerrit.wikimedia.org/r/941903 (https://phabricator.wikimedia.org/T342250) (owner: 10Clément Goubert) [12:19:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:57] (03PS3) 10JMeybohm: KubeletOperationalLatency: Raise threshold for list_images operations [alerts] - 10https://gerrit.wikimedia.org/r/941903 (https://phabricator.wikimedia.org/T342250) (owner: 10Clément Goubert) [12:23:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Manuel) > analytics-wmde-users would be a better choice Thank you @BTullis, your suggestion makes sense to me! What do you think, @karapayneWMDE? [12:26:06] jouncebot: next [12:26:06] In 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1300) [12:26:20] OK, I'll sling my no-op config out now. [12:27:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941902 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [12:27:45] (03Merged) 10jenkins-bot: ProductionServices: Define the wikifunctions orchestrator access point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941902 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [12:28:01] (03CR) 10JMeybohm: [C: 03+2] KubeletOperationalLatency: Raise threshold for list_images operations [alerts] - 10https://gerrit.wikimedia.org/r/941903 (https://phabricator.wikimedia.org/T342250) (owner: 10Clément Goubert) [12:28:30] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941902|ProductionServices: Define the wikifunctions orchestrator access point (T297314)]] [12:28:35] T297314: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 [12:29:08] (03Merged) 10jenkins-bot: KubeletOperationalLatency: Raise threshold for list_images operations [alerts] - 10https://gerrit.wikimedia.org/r/941903 (https://phabricator.wikimedia.org/T342250) (owner: 10Clément Goubert) [12:30:01] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:941902|ProductionServices: Define the wikifunctions orchestrator access point (T297314)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:32] (03CR) 10Jforrester: [C: 03+2] Localisation updates from https://translatewiki.net. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941854 (owner: 10Jforrester) [12:35:38] (03CR) 10Jforrester: [C: 03+2] PageRenderingHandler: Don't make 'read' selected if we're on the edit tab [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941853 (owner: 10Jforrester) [12:35:44] (03CR) 10Jforrester: [C: 03+2] Update function-schemata sub-module to HEAD (1c01f22) [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941852 (https://phabricator.wikimedia.org/T335583) (owner: 10Jforrester) [12:35:50] (03CR) 10Jforrester: [C: 03+2] AUTHORS: Update for July 2023 [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941851 (owner: 10Jforrester) [12:35:56] (03CR) 10Jforrester: [C: 03+2] Handle oldid url param to view a particular revision [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941850 (https://phabricator.wikimedia.org/T287514) (owner: 10Jforrester) [12:36:02] (03CR) 10Jforrester: [C: 03+2] About widget: Tie disabled state of edit pencil button to canEditObject method [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941849 (https://phabricator.wikimedia.org/T329982) (owner: 10Jforrester) [12:36:10] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941902|ProductionServices: Define the wikifunctions orchestrator access point (T297314)]] (duration: 07m 39s) [12:36:14] (03CR) 10Jforrester: [C: 03+2] Add exit cancel handling for browser events [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941848 (https://phabricator.wikimedia.org/T340627) (owner: 10Jforrester) [12:36:14] T297314: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 [12:36:20] (03CR) 10Jforrester: [C: 03+2] ExpandedToggle: Rotate icon the other way on dir=rtl context [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941847 (https://phabricator.wikimedia.org/T337988) (owner: 10Jforrester) [12:36:26] (03CR) 10Jforrester: [C: 03+2] Don't hard-code parantheses, as they differ by language [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941846 (owner: 10Jforrester) [12:36:32] (03CR) 10Jforrester: [C: 03+2] docs: Ensure we have a proper file-level block on every code file [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941845 (owner: 10Jforrester) [12:36:38] (03CR) 10Jforrester: [C: 03+2] Set label changes to true when updating Function Description [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941844 (https://phabricator.wikimedia.org/T342596) (owner: 10Jforrester) [12:36:44] (03CR) 10Jforrester: [C: 03+2] docs: Move Vue top-level comment out of the rendered DOM [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941843 (owner: 10Jforrester) [12:36:50] (03CR) 10Jforrester: [C: 03+2] Localisation updates from https://translatewiki.net. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941842 (owner: 10Jforrester) [12:39:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [12:40:27] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941842 (owner: 10Jforrester) [12:40:33] (03Merged) 10jenkins-bot: docs: Move Vue top-level comment out of the rendered DOM [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941843 (owner: 10Jforrester) [12:40:48] (03Merged) 10jenkins-bot: Set label changes to true when updating Function Description [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941844 (https://phabricator.wikimedia.org/T342596) (owner: 10Jforrester) [12:40:51] (03Merged) 10jenkins-bot: docs: Ensure we have a proper file-level block on every code file [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941845 (owner: 10Jforrester) [12:40:57] (03Merged) 10jenkins-bot: Don't hard-code parantheses, as they differ by language [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941846 (owner: 10Jforrester) [12:41:03] (03Merged) 10jenkins-bot: ExpandedToggle: Rotate icon the other way on dir=rtl context [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941847 (https://phabricator.wikimedia.org/T337988) (owner: 10Jforrester) [12:41:09] (03Merged) 10jenkins-bot: Add exit cancel handling for browser events [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941848 (https://phabricator.wikimedia.org/T340627) (owner: 10Jforrester) [12:41:15] (03Merged) 10jenkins-bot: About widget: Tie disabled state of edit pencil button to canEditObject method [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941849 (https://phabricator.wikimedia.org/T329982) (owner: 10Jforrester) [12:41:21] (03Merged) 10jenkins-bot: Handle oldid url param to view a particular revision [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941850 (https://phabricator.wikimedia.org/T287514) (owner: 10Jforrester) [12:41:27] (03Merged) 10jenkins-bot: AUTHORS: Update for July 2023 [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941851 (owner: 10Jforrester) [12:41:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [12:41:33] (03Merged) 10jenkins-bot: Update function-schemata sub-module to HEAD (1c01f22) [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941852 (https://phabricator.wikimedia.org/T335583) (owner: 10Jforrester) [12:42:20] (03Merged) 10jenkins-bot: PageRenderingHandler: Don't make 'read' selected if we're on the edit tab [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941853 (owner: 10Jforrester) [12:44:02] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941854 (owner: 10Jforrester) [12:45:16] (03PS1) 10Andrew Bogott: Update horizon/docker version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/941926 [12:45:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T342617)', diff saved to https://phabricator.wikimedia.org/P49716 and previous config saved to /var/cache/conftool/dbconfig/20230726-124545-ladsgroup.json [12:45:49] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:46:52] (03CR) 10Andrew Bogott: [C: 03+2] Update horizon/docker version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/941926 (owner: 10Andrew Bogott) [12:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:52:55] !log jforrester@deploy1002 Synchronized php-1.41.0-wmf.19/extensions/WikiLambda/: Update WikiLambda wmf.19 branch to latest ahead of wikifunctions.org roll-out (duration: 07m 10s) [12:53:35] 10sre-alert-triage, 10serviceops: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10fgiunchedi) [12:53:39] (03PS1) 10Sohom Datta: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) [12:54:10] 10sre-alert-triage, 10serviceops: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10fgiunchedi) [12:54:23] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342754 (10fgiunchedi) [12:54:28] (03CR) 10Jforrester: Create puppet scripting for sqooping Wikifunctions tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [12:54:30] 10sre-alert-triage, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Alert triage - https://phabricator.wikimedia.org/T342250 (10JMeybohm) [12:54:45] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10fgiunchedi) [12:55:46] 10sre-alert-triage, 10cloud-services-team: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342757 (10LSobanski) [12:56:55] 10sre-alert-triage, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Alert triage: KubeletOperationalLatency - https://phabricator.wikimedia.org/T342250 (10JMeybohm) [12:57:21] 10sre-alert-triage, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Alert triage: KubeletOperationalLatency - https://phabricator.wikimedia.org/T342250 (10JMeybohm) 05Open→03Resolved I believe this is resolved now [12:57:41] 10sre-alert-triage, 10serviceops: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342758 (10LSobanski) [12:58:22] RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:25] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342754 (10SLyngshede-WMF) a:03SLyngshede-WMF Leftovers from a failover test. ` slyngshede@idm2001:~$ sudo systemctl status rq-bitu.service ● rq-bitu.service Loaded: not-found (Re... [12:59:52] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342754 (10SLyngshede-WMF) 05Open→03Resolved Alert has cleared. [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1300). [13:00:06] toni_ and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] toni_: I can deploy your backlog patch if you want? [13:00:14] \o [13:00:24] Oh, and Dreamy_Jazz, didn't refresh the page fast enough. :-) [13:00:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P49717 and previous config saved to /var/cache/conftool/dbconfig/20230726-130051-ladsgroup.json [13:00:58] sure, ready whenever you are [13:01:06] Dreamy_Jazz: You want me to cherry-pick the CU patch, land it, then manually make the tables? [13:01:08] toni_: Going now. [13:01:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941514 (https://phabricator.wikimedia.org/T341896) (owner: 10Tsevener) [13:01:30] The SQL files should exist on testwiki [13:01:58] Oh, right, it's already in the train, you just want me to do the mwscripts? [13:01:58] Cool. [13:01:58] * urbanecm also waves [13:02:04] Yup [13:02:05] Thanks [13:02:05] but sees James_F is on the window :) [13:02:06] (03Merged) 10jenkins-bot: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941514 (https://phabricator.wikimedia.org/T341896) (owner: 10Tsevener) [13:02:10] Just for testwiki though [13:02:24] As the tables should not exist anywhere else but s3 and s5. [13:02:26] urbanecm: Thought I'd get some practice in ahead of later today. ;-) [13:02:29] Dreamy_Jazz: Ack. [13:02:35] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941514|Add stream config for iOS schema (T341896)]] [13:02:39] T341896: Instrument New Diff Screen and actions - https://phabricator.wikimedia.org/T341896 [13:03:18] good luck with wikifunctions :) [13:04:46] Dreamy_Jazz: So `mwscript sql.php --wiki=testwiki extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql` and `mwscript sql.php --wiki=testwiki extensions/CheckUser/schema/mysql/cu_useragent_clienthints_map.sql`? [13:05:02] Yes. That looks correct to me. [13:05:22] OK, done. [13:05:25] i'll log. [13:05:57] !log Created cu_useragent_clienthints.sql and cu_useragent_clienthints_map.sql on testwiki for T258105 [13:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:02] T258105: Implement storage for User-Agent Client Hints header data - https://phabricator.wikimedia.org/T258105 [13:06:34] Thanks! [13:06:57] And `DESCRIBE cu_useragent_clienthints;` returns a table not an error, so {{done}}. :-) [13:07:06] :) [13:08:07] * James_F twiddles thumbs waiting for scap. Sorry it's so slow, toni_! [13:08:16] urbanecm: Ha, thanks! [13:08:20] 10sre-alert-triage, 10serviceops: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342758 (10JMeybohm) 05Open→03Resolved a:03JMeybohm >>! [[ https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Envoy | Runbook ]] said: > If you see an error about runtime variables being... [13:08:50] 10sre-alert-triage, 10serviceops: Alert triage: EnvoyRuntimeAdminOverrides on restbase1027 - https://phabricator.wikimedia.org/T342758 (10JMeybohm) [13:08:56] no prob - will I be testing in mwdebug1002.eqiad.wmnet? [13:09:22] Yeah, or wherever; scap now slams them out on all debug servers at once. [13:09:32] But it's still in the k8s build step. [13:09:42] ah ok cool [13:09:49] 10sre-alert-triage, 10serviceops: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342761 (10LSobanski) [13:10:37] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) I think the difficulty relates to the sheer number of thumbnails: ` x=0 for i in $(swift list --prefix wikipedia-c... [13:12:19] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10LSobanski) [13:12:53] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10LSobanski) [13:13:26] Hmm. No movement in 10 minutes. :-( [13:13:34] I'm going to kill and re-start scap. [13:13:35] !log jforrester@deploy1002 sync-world aborted: Backport for [[gerrit:941514|Add stream config for iOS schema (T341896)]] (duration: 11m 00s) [13:13:39] T341896: Instrument New Diff Screen and actions - https://phabricator.wikimedia.org/T341896 [13:14:05] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941514|Add stream config for iOS schema (T341896)]] [13:15:25] James_F: Have you checked the build log? [13:15:34] 10sre-alert-triage, 10serviceops: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10JMeybohm) This looks like a fallout from {T341859} [13:15:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P49718 and previous config saved to /var/cache/conftool/dbconfig/20230726-131557-ladsgroup.json [13:16:43] claime: Sadly I already re-started so I guess the old log got removed. [13:16:53] ack [13:17:06] Sometimes it's just the upload being slow [13:17:13] 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10JMeybohm) something scap ` Jul 25 03:01:23 deploy1002 scap[30573]: At least one patch failed to apply Jul 25 03:01:23 deploy1002 scap[30573]: 03:01:23 stage-train failed: (It's running fine right now.) [13:17:40] (03PS2) 10Sohom Datta: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) [13:18:54] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10BTullis) That's interesting. I don't know why the puppet run alerts are still unknown in icinga for these four analytics hosts. {F37150542} https://alerts.wikimedia.org/?q=alertname%3Dpuppet%20last%20run&q=team... [13:20:43] 10sre-alert-triage, 10cloud-services-team: Alert triage: severity adjustment - https://phabricator.wikimedia.org/T342764 (10LSobanski) [13:21:37] (03CR) 10Urbanecm: [C: 04-1] Add validator userright for pawikisource (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [13:22:44] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10BTullis) Tagging @bking and @RKemper in case they're unaware of this. [13:22:59] !log fab@deploy1002 Started deploy [airflow-dags/research@e7b9253]: (no justification provided) [13:23:07] !log fab@deploy1002 Finished deploy [airflow-dags/research@e7b9253]: (no justification provided) (duration: 00m 07s) [13:23:43] !log jforrester@deploy1002 jforrester and tsev: Backport for [[gerrit:941514|Add stream config for iOS schema (T341896)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:23:47] T341896: Instrument New Diff Screen and actions - https://phabricator.wikimedia.org/T341896 [13:24:28] toni_: OK! It's now live on debug 1002 (and 1001, 2002, 2001, and k8s) [13:24:38] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10BTullis) A manual run on analytics1070 as the nagios user reports success. ` nagios@analytics1070:/home/btullis$ /usr/bin/sudo /usr/local/lib/nagios/plugins/check_puppetrun -w 10800 -c 21600 OK: Puppet is curre... [13:25:08] looks good! [13:25:17] Awesome, syncing now. [13:30:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T342617)', diff saved to https://phabricator.wikimedia.org/P49719 and previous config saved to /var/cache/conftool/dbconfig/20230726-133104-ladsgroup.json [13:31:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:31:08] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:31:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:34:21] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941514|Add stream config for iOS schema (T341896)]] (duration: 20m 16s) [13:34:25] T341896: Instrument New Diff Screen and actions - https://phabricator.wikimedia.org/T341896 [13:35:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:30] OK, all done. [13:38:21] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [13:42:57] (03PS1) 10Urbanecm: Revert "specials: Use cross-wiki aware UserIdentityLookup on Special:UserRights" [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941877 (https://phabricator.wikimedia.org/T255309) [13:43:21] jouncebot: nowandnext [13:43:22] For the next 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1300) [13:43:22] In 2 hour(s) and 16 minute(s): Wikifunctions.org creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1600) [13:43:40] James_F thank you! [13:44:11] (03CR) 10Urbanecm: [C: 03+2] Revert "specials: Use cross-wiki aware UserIdentityLookup on Special:UserRights" [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941877 (https://phabricator.wikimedia.org/T255309) (owner: 10Urbanecm) [13:44:46] * urbanecm steals the deployment floor for a train blocker [13:46:30] !log begin reboot of lvs4010 (T335835) [13:46:36] urbanecm: Go for it. [13:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:40] (03CR) 10Ssingh: [C: 03+1] "LGTM, one minor nit that you can ignore if you had like." [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 (owner: 10Volans) [13:46:44] waiting on CI now :) [13:48:38] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove frmon1001 and frmon2001 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/941480 (https://phabricator.wikimedia.org/T342693) (owner: 10Dwisehaupt) [13:50:01] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet [13:50:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:26] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:51:48] what a better time to have a persistent gate failure in core's selenium tests than right now :) [13:52:50] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet [13:52:58] PROBLEM - Host lvs4010 is DOWN: PING CRITICAL - Packet loss = 100% [13:53:06] RECOVERY - Host lvs4010 is UP: PING OK - Packet loss = 0%, RTA = 70.89 ms [13:53:34] PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:55:02] RECOVERY - pybal on lvs4010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:55:55] !log end reboot of lvs4010 (T335835) [13:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:13] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 (owner: 10Volans) [13:59:53] (03Merged) 10jenkins-bot: Revert "specials: Use cross-wiki aware UserIdentityLookup on Special:UserRights" [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941877 (https://phabricator.wikimedia.org/T255309) (owner: 10Urbanecm) [14:00:07] at least wmf.19 merges work [14:00:07] (03CR) 10Ssingh: [C: 03+1] sre.hosts.decommission: search in the DNS repo too (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 (owner: 10Volans) [14:00:21] !log begin reboot of lvs4008 (T335835) [14:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:27] (03PS2) 10Hnowlan: images: enable "debug" on memcache, log when servers are dead [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/941901 (https://phabricator.wikimedia.org/T341805) [14:00:51] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:941877|Revert "specials: Use cross-wiki aware UserIdentityLookup on Special:UserRights" (T255309 T342747)]] [14:00:56] T255309: Remove UserRightsProxy and replace its usages with UserGroupManager - https://phabricator.wikimedia.org/T255309 [14:00:56] T342747: Special:UserRights in interwiki mode: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'metawiki' - https://phabricator.wikimedia.org/T342747 [14:01:57] fabfur: i just saw your lvs log entries; i'm in a middle of a scap MW deployment (to fix a train blocker). i recall several recent outages related to concurrent lvs and mw deployment -- is that still going to be an issue? [14:02:15] (i'll let scap wait on the mwdebug stage until this question is answered) [14:02:29] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:941877|Revert "specials: Use cross-wiki aware UserIdentityLookup on Special:UserRights" (T255309 T342747)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:04:23] urbanecm: don't know if that could be a problem but I'm working only in ulsfo if it matters [14:04:42] and the traffic should switch seamlessly on the passive loadbalancer [14:04:52] urbanecm: thanks for checking but we resolved that [14:05:02] so no issues during a deploy and LVS maintainenance [14:05:14] (03PS6) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [14:05:16] sukhe: okay, that's great to hear. wouldn't want to depool random servers again :)) [14:05:23] in that case, i'll continue with the deployment. [14:05:26] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:05:32] urbanecm: haha yeah! [14:05:33] please do [14:05:40] (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [14:06:28] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:06:36] (03PS1) 10Jforrester: Normalize the skin name when it comes from preferences or useskin [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941879 (https://phabricator.wikimedia.org/T342733) [14:06:57] patch fixes the issue, proceeding. [14:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:58] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:15] (03Abandoned) 10Alexandros Kosiaris: service::catalog: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/941775 (https://phabricator.wikimedia.org/T297314) (owner: 10Alexandros Kosiaris) [14:11:17] (03PS7) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [14:12:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:12:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:12:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T342617)', diff saved to https://phabricator.wikimedia.org/P49720 and previous config saved to /var/cache/conftool/dbconfig/20230726-141228-ladsgroup.json [14:12:32] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:13:10] (03CR) 10Bking: flink-zk: Initiate new flink::zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [14:13:20] (03CR) 10Bking: flink-zk: Initiate new flink::zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [14:13:25] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:941877|Revert "specials: Use cross-wiki aware UserIdentityLookup on Special:UserRights" (T255309 T342747)]] (duration: 12m 33s) [14:13:26] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:30] T255309: Remove UserRightsProxy and replace its usages with UserGroupManager - https://phabricator.wikimedia.org/T255309 [14:13:30] T342747: Special:UserRights in interwiki mode: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'metawiki' - https://phabricator.wikimedia.org/T342747 [14:13:33] okay, deployment done. [14:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:13:50] James_F: if you want to ship https://gerrit.wikimedia.org/r/c/mediawiki/core/+/941879/, feel free to :). [14:14:00] Ack. [14:14:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941879 (https://phabricator.wikimedia.org/T342733) (owner: 10Jforrester) [14:14:38] (03CR) 10Ssingh: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:17:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:39] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: route requests to proton via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/941440 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [14:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:19:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet [14:19:55] !log disabling puppet on A:cp to deploy r/941440 [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:01] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route requests to proton via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/941440 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [14:23:20] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:24:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4008.ulsfo.wmnet [14:26:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:26:43] (03CR) 10JMeybohm: "Are the namespace limits high enough to do this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/941900 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [14:26:46] (03PS1) 10Daniel Kinzler: Re-enable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941946 (https://phabricator.wikimedia.org/T339867) [14:27:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4008.ulsfo.wmnet [14:27:26] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:27:32] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:27:53] (03CR) 10Clément Goubert: mw-api-int: Raise number of replicas to 10 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941900 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [14:28:25] !log end reboot of lvs4008 (T335835) [14:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:54] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:29:00] RECOVERY - pybal on lvs4008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:30:03] (03Merged) 10jenkins-bot: Normalize the skin name when it comes from preferences or useskin [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941879 (https://phabricator.wikimedia.org/T342733) (owner: 10Jforrester) [14:30:21] (03CR) 10David Martin: Create puppet scripting for sqooping Wikifunctions tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [14:30:33] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941879|Normalize the skin name when it comes from preferences or useskin (T342733)]] [14:30:37] T342733: SkinException: No registered builder available for . - https://phabricator.wikimedia.org/T342733 [14:31:38] !log test [14:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:43] (03CR) 10JMeybohm: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:32:06] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:941879|Normalize the skin name when it comes from preferences or useskin (T342733)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:33:00] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:33:09] (03PS1) 10Clément Goubert: P:parsoid::vd_client: Don't try to restart if stopped [puppet] - 10https://gerrit.wikimedia.org/r/941948 (https://phabricator.wikimedia.org/T342760) [14:33:29] !log enabling puppet on A:cp to deploy r/941440 [14:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:28] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet [14:34:35] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42706/console" [puppet] - 10https://gerrit.wikimedia.org/r/941948 (https://phabricator.wikimedia.org/T342760) (owner: 10Clément Goubert) [14:35:21] (03CR) 10Jforrester: [C: 03+1] service::catalog: Switch wikifunctions to state production [puppet] - 10https://gerrit.wikimedia.org/r/941314 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:38:58] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941879|Normalize the skin name when it comes from preferences or useskin (T342733)]] (duration: 08m 24s) [14:39:02] T342733: SkinException: No registered builder available for . - https://phabricator.wikimedia.org/T342733 [14:40:19] (03PS10) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:40:21] (03PS11) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:40:23] (03PS11) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:40:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] P:parsoid::vd_client: Don't try to restart if stopped [puppet] - 10https://gerrit.wikimedia.org/r/941948 (https://phabricator.wikimedia.org/T342760) (owner: 10Clément Goubert) [14:41:21] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:parsoid::vd_client: Don't try to restart if stopped [puppet] - 10https://gerrit.wikimedia.org/r/941948 (https://phabricator.wikimedia.org/T342760) (owner: 10Clément Goubert) [14:43:11] urbanecm: Filed T342775 as the merge-blocker. [14:43:12] T342775: Code merge blocker: MediaWiki\Auth\AuthManagerTest::testSecuritySensitiveOperationStatus with data set #0 (true) - https://phabricator.wikimedia.org/T342775 [14:43:17] ty [14:47:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) Server has not arrived yet they should arrive any day now. Jul 24, 2023 9:16 PM Departed Terminal Location ASHBURN, 20147, US [14:47:50] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:48:17] (03PS1) 10Hashar: admin: hashar: update gdbinit from php 7.4.30 [puppet] - 10https://gerrit.wikimedia.org/r/941949 [14:49:18] !log begin reboot of lvs4009 (T335835) [14:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:20] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:26] (03PS2) 10Hashar: admin: hashar: update gdbinit from php 7.4.30 [puppet] - 10https://gerrit.wikimedia.org/r/941949 [14:50:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:04] (03PS1) 10DCausse: Load RescoreFunctions from the ExtensionRegistry [extensions/CirrusSearch] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941882 (https://phabricator.wikimedia.org/T342744) [14:52:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] "We tended to wait a bit before merging these kinds of changes in order to avoid false alerts paging people. However, we now have the page " [puppet] - 10https://gerrit.wikimedia.org/r/941314 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:53:50] PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:54:02] PROBLEM - pybal on lvs4009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:54:50] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [14:55:09] jouncebot: nowandnext [14:55:09] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [14:55:10] In 1 hour(s) and 4 minute(s): Wikifunctions.org creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1600) [14:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T342617)', diff saved to https://phabricator.wikimedia.org/P49721 and previous config saved to /var/cache/conftool/dbconfig/20230726-145651-ladsgroup.json [14:56:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:57:45] (03PS3) 10Ladsgroup: ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [14:57:48] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [14:58:27] (03Merged) 10jenkins-bot: ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [14:59:06] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:939697|ores-extension: enable lw on eswikiquotes and eswikibooks (T342115)]] [14:59:10] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [15:00:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:00:38] !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:939697|ores-extension: enable lw on eswikiquotes and eswikibooks (T342115)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:04:35] \o/ [15:10:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:10:36] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4009.ulsfo.wmnet [15:11:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P49722 and previous config saved to /var/cache/conftool/dbconfig/20230726-151157-ladsgroup.json [15:13:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:939697|ores-extension: enable lw on eswikiquotes and eswikibooks (T342115)]] (duration: 14m 06s) [15:13:16] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [15:13:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4009.ulsfo.wmnet [15:13:30] PROBLEM - Host lvs4009 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:38] RECOVERY - Host lvs4009 is UP: PING OK - Packet loss = 0%, RTA = 70.96 ms [15:14:01] hmm [15:14:42] PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:14:46] ? [15:14:54] RECOVERY - pybal on lvs4009 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:15:03] ah, there was a reboot [15:15:27] jynus: all good, thanks [15:15:33] we are trying to figure out why this wasn't silenced [15:15:36] jynus: yes, we are rebooting [15:15:43] downtimed on Icinga correctly [15:16:10] RECOVERY - PyBal backends health check on lvs4009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:16:40] RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:17:48] !log end reboot of lvs4009 (T335835) [15:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:34] !log begin reboot of lvs1020 (T335835) [15:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [15:22:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941882 (https://phabricator.wikimedia.org/T342744) (owner: 10DCausse) [15:22:49] James_F: thanks for deploying! <3 [15:23:07] dcausse: I have a vested interest, but sure. :-D [15:23:13] dcausse: Thanks for fixing! [15:23:18] :) [15:25:06] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [15:26:08] * James_F glares at CI. [15:26:30] !log end reboot of lvs1020 (T335835) [15:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P49723 and previous config saved to /var/cache/conftool/dbconfig/20230726-152703-ladsgroup.json [15:28:10] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams to knams - https://phabricator.wikimedia.org/T342198 (10Papaul) Knams confirmed that it is not on issue bring the router in on a week. See email below ` Dear Papaul, If you a carrying the router with you, than no further action is needed... [15:29:11] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10MatthewVernon) I thought we were ordering one node per DC, so one in eqiad and one in codfw? Also: the drives should be JBOD non-RAID, not individual... [15:30:05] !log begin reboot of lvs1017 (T335835) [15:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) I thought we were ordering one node per DC, so one in eqiad and one in codfw? Also: the drives should be JBOD non-RAID, not individual... [15:32:16] !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [15:32:33] (03PS8) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [15:32:36] !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [15:32:59] (03CR) 10D3r1ck01: [C: 03+2] "deploy due to UBN" [deployment-charts] - 10https://gerrit.wikimedia.org/r/941952 (https://phabricator.wikimedia.org/T342783) (owner: 10D3r1ck01) [15:33:40] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:33:42] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) [back-of-the-envelope suggests a few days, so don't expect immediate answers!] [15:33:50] (03Merged) 10jenkins-bot: Bump Proton to 2023-07-26-152000-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/941952 (https://phabricator.wikimedia.org/T342783) (owner: 10D3r1ck01) [15:34:02] !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [15:34:13] Good grief, it's still going. [15:34:17] (03PS1) 10Dreamy Jazz: clienthints: Start collecting client hints data on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941954 (https://phabricator.wikimedia.org/T341110) [15:34:30] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:34:39] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lw on itwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941955 (https://phabricator.wikimedia.org/T342115) [15:34:58] !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [15:35:12] !log derick@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [15:37:05] !log derick@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [15:37:09] (03Merged) 10jenkins-bot: Load RescoreFunctions from the ExtensionRegistry [extensions/CirrusSearch] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941882 (https://phabricator.wikimedia.org/T342744) (owner: 10DCausse) [15:37:17] !log derick@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [15:37:40] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941882|Load RescoreFunctions from the ExtensionRegistry (T342744)]] [15:37:44] T342744: CirrusSearch\Profile\SearchProfileException: Cannot load a profile type rescore_function_chains: growth_underlinked_chain not found - https://phabricator.wikimedia.org/T342744 [15:38:36] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:38:37] !log derick@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [15:39:26] !log jforrester@deploy1002 dcausse and jforrester: Backport for [[gerrit:941882|Load RescoreFunctions from the ExtensionRegistry (T342744)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:42:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T342617)', diff saved to https://phabricator.wikimedia.org/P49724 and previous config saved to /var/cache/conftool/dbconfig/20230726-154209-ladsgroup.json [15:42:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:42:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:42:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:42:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:42:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) For `vrrp_peer` instead of doing some costly/complicated query from Netbox, I'm wondering if we could/should do it... [15:42:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:42:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T342617)', diff saved to https://phabricator.wikimedia.org/P49725 and previous config saved to /var/cache/conftool/dbconfig/20230726-154245-ladsgroup.json [15:43:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10ayounsi) [15:44:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) [15:46:32] dcausse: OK, it's fpm-restarting now. Should be done. [15:46:45] \o/ [15:46:47] jnuche: Theoretically train is clear to roll to group1… [15:47:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [15:47:20] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941882|Load RescoreFunctions from the ExtensionRegistry (T342744)]] (duration: 09m 39s) [15:47:24] T342744: CirrusSearch\Profile\SearchProfileException: Cannot load a profile type rescore_function_chains: growth_underlinked_chain not found - https://phabricator.wikimedia.org/T342744 [15:47:49] (03CR) 10Btullis: flink-zk: Initiate new flink::zookeeper role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [15:48:06] (03PS2) 10Dreamy Jazz: clienthints: Start collecting client hints data on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941954 (https://phabricator.wikimedia.org/T341110) [15:48:30] James_F: sure, it will probably overflow into the Wikifunctions.org window for a few minutes, but should be ok [15:48:32] all aboard the train [15:48:36] jnuche: Of course. [15:48:38] deploying to group1 again [15:49:04] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941957 (https://phabricator.wikimedia.org/T340247) [15:49:06] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941957 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [15:49:48] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941957 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [15:52:24] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet [15:55:33] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet [15:55:36] PROBLEM - Host lvs1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:36] RECOVERY - Host lvs1017 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:55:46] :] [15:55:55] :| [15:56:08] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:56:09] fabfur: will file a task :) [15:56:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:56:54] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.19 refs T340247 [15:56:58] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:56:58] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [15:57:40] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:00:05] James_F: gettimeofday() says it's time for Wikifunctions.org creation. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1600) [16:00:09] Whee. [16:00:34] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:01:09] jnuche: Say when. :-) [16:01:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:39] nearly there, restarting php-fpm... [16:03:04] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:03:51] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.19 refs T340247 (duration: 06m 56s) [16:03:55] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [16:03:58] James_F: wikifunction away! [16:04:03] Yay. [16:04:38] (03PS12) 10Jforrester: Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [16:04:40] (03PS16) 10Jforrester: Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [16:04:42] (03PS10) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [16:04:44] (03PS8) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [16:04:46] (03PS14) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [16:04:48] (03PS3) 10Jforrester: [DNM] Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 [16:04:49] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:05:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:06:10] (03Merged) 10jenkins-bot: Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:06:36] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:771623|Add wikifunctions.org to prod wgLocalVirtualHosts (T275945)]] [16:06:40] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [16:07:05] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [16:07:09] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:07:16] !log end reboot of lvs1017 (T335835) [16:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:04] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:13] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:771623|Add wikifunctions.org to prod wgLocalVirtualHosts (T275945)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:09:16] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [16:14:02] (03CR) 10Jforrester: [C: 03+2] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:14:41] (03Merged) 10jenkins-bot: Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:15:43] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:771623|Add wikifunctions.org to prod wgLocalVirtualHosts (T275945)]] (duration: 09m 07s) [16:15:47] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [16:16:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10cmooney) [16:16:21] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [16:18:21] !log begin reboot of lvs1018 (T335835) [16:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:58] !log jforrester@deploy1002 Started scap: Initial deploy of wikifunctionswiki in locked-down mode for T275945 [16:22:30] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:22:48] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:23:13] (03PS1) 10EoghanGaffney: releases: Change owner on /srv/patches after rsync [puppet] - 10https://gerrit.wikimedia.org/r/941961 (https://phabricator.wikimedia.org/T342016) [16:23:45] (03CR) 10Ssingh: "Thanks for this change!" [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [16:25:02] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [16:26:04] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42707/console" [puppet] - 10https://gerrit.wikimedia.org/r/941961 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [16:27:03] (03PS2) 10EoghanGaffney: releases: Change owner on /srv/patches after rsync [puppet] - 10https://gerrit.wikimedia.org/r/941961 (https://phabricator.wikimedia.org/T342016) [16:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T342617)', diff saved to https://phabricator.wikimedia.org/P49726 and previous config saved to /var/cache/conftool/dbconfig/20230726-162705-ladsgroup.json [16:27:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:27:48] !log jforrester@deploy1002 Finished scap: Initial deploy of wikifunctionswiki in locked-down mode for T275945 (duration: 07m 49s) [16:27:51] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [16:27:54] (03PS2) 10FNegri: tcpircbot: add another port for cloud IRC logging [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) [16:39:36] James_F: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/mediawiki.yaml#233 defiens a special docroot for wikifunctions.org, but I don't see that in any of your patches. intentional? [16:39:57] taavi: Ooooh. [16:40:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [16:40:12] * James_F will make a quick patch. [16:40:43] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Krinkle) >>! In T211661#9044465, @MatthewVernon wrote: > […] Out of interest, I went looking at how many of these we served on 24... [16:40:55] * taavi assumes that means 'no' [16:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P49727 and previous config saved to /var/cache/conftool/dbconfig/20230726-164211-ladsgroup.json [16:42:30] Indeed. [16:43:25] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [16:43:36] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:44:22] !log end reboot of lvs1018 (T335835) [16:44:23] (03PS1) 10Jforrester: docroot: Add wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941965 (https://phabricator.wikimedia.org/T275945) [16:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941965 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:44:48] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:45:01] taavi: It doesn't help that WikimediaDebug hard-codes what domains it knows about. Ah well. [16:45:06] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:45:21] (03Merged) 10jenkins-bot: docroot: Add wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941965 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:45:48] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941965|docroot: Add wikifunctions.org (T275945)]] [16:45:53] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [16:46:11] (03PS1) 10Ssingh: Release dnsdist 1.8.0-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/941966 (https://phabricator.wikimedia.org/T342154) [16:47:00] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [16:47:41] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:941965|docroot: Add wikifunctions.org (T275945)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:50:43] (03CR) 10FNegri: tcpircbot: add another port for cloud IRC logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) (owner: 10FNegri) [16:50:54] taavi: I think that should fix www.wikifunctions.org not routing, but not wikifunctions.org; that's probably a DNS twiddle we forgot. [16:53:07] Yup, now upgraded to "No wiki found". [16:53:54] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941965|docroot: Add wikifunctions.org (T275945)]] (duration: 08m 05s) [16:53:59] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [16:55:59] (03PS5) 10Kamila Součková: add WIP Benthos smoke test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [16:56:28] (03PS1) 10Krinkle: api: Fix broken /api/index.html rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941969 (https://phabricator.wikimedia.org/T113114) [16:57:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P49728 and previous config saved to /var/cache/conftool/dbconfig/20230726-165717-ladsgroup.json [17:02:38] (03PS1) 10Jforrester: MWMultiVersion: Alert this code to wikifunctions.org existing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941970 (https://phabricator.wikimedia.org/T275945) [17:03:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941970 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [17:04:01] (03Merged) 10jenkins-bot: MWMultiVersion: Alert this code to wikifunctions.org existing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941970 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [17:04:09] James_F: you probably want an entry here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/mediawiki/files/apache/sites/redirects/redirects.dat#183 for the plain wikifunctions.org redirect [17:04:27] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941970|MWMultiVersion: Alert this code to wikifunctions.org existing (T275945)]] [17:04:32] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [17:04:32] taavi: Aha, right. Thanks. [17:06:05] (03CR) 10Kamila Součková: "Note that upon deploy, this will read all the kafka messages in the selected topic. I am hoping that reading is cheap, but I will get some" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [17:06:08] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:941970|MWMultiVersion: Alert this code to wikifunctions.org existing (T275945)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [17:07:45] did you have a workaround for the x-w-d extension not working? :P [17:08:07] (03PS1) 10Jforrester: apache: Redirect wikifunctions.org to www.wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941971 (https://phabricator.wikimedia.org/T275945) [17:08:09] (03PS1) 10Jforrester: Add wikifunctions.org to certspotter::monitor_domains [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) [17:08:34] taavi: curl with Host: header. [17:08:50] But it's not like it could be more broken. :-) [17:11:35] `Fatal exception of type "Wikimedia\Rdbms\DBQueryError"` indeed, not sure what I expected [17:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T342617)', diff saved to https://phabricator.wikimedia.org/P49729 and previous config saved to /var/cache/conftool/dbconfig/20230726-171223-ladsgroup.json [17:12:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:12:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:12:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49730 and previous config saved to /var/cache/conftool/dbconfig/20230726-171244-ladsgroup.json [17:13:08] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941970|MWMultiVersion: Alert this code to wikifunctions.org existing (T275945)]] (duration: 08m 40s) [17:13:12] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [17:13:42] (03PS3) 10FNegri: tcpircbot: add another port for cloud IRC logging [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) [17:15:14] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10thcipriani) [17:19:29] (03CR) 10FNegri: "test experimental" [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) (owner: 10FNegri) [17:19:59] (03PS9) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [17:23:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [17:23:42] (03PS1) 10Jforrester: wgNoFollowDomainExceptions: Add wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941974 (https://phabricator.wikimedia.org/T275945) [17:23:44] (03PS1) 10Jforrester: Wikifunctions: Allow wikifunctions-staff to give and take all the rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941975 (https://phabricator.wikimedia.org/T275945) [17:24:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941974 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [17:25:08] (03Merged) 10jenkins-bot: wgNoFollowDomainExceptions: Add wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941974 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [17:25:24] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) (owner: 10FNegri) [17:25:38] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:941974|wgNoFollowDomainExceptions: Add wikifunctions.org (T275945)]] [17:25:42] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [17:26:12] (03PS3) 10Sohom Datta: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) [17:27:22] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:941974|wgNoFollowDomainExceptions: Add wikifunctions.org (T275945)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [17:27:33] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [17:28:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:51] (03CR) 10Sohom Datta: Add validator userright for pawikisource (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [17:29:09] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [17:30:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1730) [17:30:08] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) [17:33:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:37] (03CR) 10Jforrester: [C: 03+2] Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 (owner: 10Jforrester) [17:33:40] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Allow wikifunctions-staff to give and take all the rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941975 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [17:33:57] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 (owner: 10Jforrester) [17:34:43] (03Merged) 10jenkins-bot: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 (owner: 10Jforrester) [17:34:45] (03Merged) 10jenkins-bot: Wikifunctions: Allow wikifunctions-staff to give and take all the rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941975 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [17:34:48] (03Merged) 10jenkins-bot: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 (owner: 10Jforrester) [17:35:14] (03PS10) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [17:37:05] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:941974|wgNoFollowDomainExceptions: Add wikifunctions.org (T275945)]] (duration: 11m 27s) [17:37:09] T275945: Create Wikifunctions.org - https://phabricator.wikimedia.org/T275945 [17:39:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10ssingh) Hi @Isaac: The link you shared is correct and no SSH keys are required. We will wait for @NHillard-WMF to confirm if he requires more access, otherwise this tic... [17:39:57] (03CR) 10Bking: "Check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [17:41:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10ssingh) [17:41:38] !log jforrester@deploy1002 Started scap: Hopefully final update for wikifunctions.org initial config [17:44:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10ssingh) Hi @NHillard-WMF: This will require approval from your manager. Also adding @odimitrijevic and @Milimetric for approval on the Analytics side. [17:46:51] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) [17:49:08] !log jforrester@deploy1002 Finished scap: Hopefully final update for wikifunctions.org initial config (duration: 07m 30s) [17:50:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10Isaac) [17:52:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10Isaac) Same idea as T342588. FYI @Maryana I think this will require manager approval per T342588#9045685 [17:53:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) Thanks @ssingh ! For Maryana's access, I filed T342797 [17:55:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10ssingh) [17:57:40] (03PS1) 10Cory Massaro: Pass correct format of the Host header into the orhcestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941978 (https://phabricator.wikimedia.org/T342795) [17:58:37] (03CR) 10Cory Massaro: "No idea if this is right, but ..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/941978 (https://phabricator.wikimedia.org/T342795) (owner: 10Cory Massaro) [17:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49731 and previous config saved to /var/cache/conftool/dbconfig/20230726-175850-ladsgroup.json [17:58:56] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:00:06] jnuche and dancy: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1800). [18:00:06] jnuche and dancy: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1800). [18:03:42] (03PS2) 10Cory Massaro: Pass correct format of the Host header into the orhcestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941978 (https://phabricator.wikimedia.org/T342795) [18:04:07] (03CR) 10Jforrester: [C: 03+2] Pass correct format of the Host header into the orhcestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941978 (https://phabricator.wikimedia.org/T342795) (owner: 10Cory Massaro) [18:04:58] (03Merged) 10jenkins-bot: Pass correct format of the Host header into the orhcestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941978 (https://phabricator.wikimedia.org/T342795) (owner: 10Cory Massaro) [18:06:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10ssingh) [18:07:41] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/941961 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [18:08:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10ssingh) @thcipriani and @odimitrijevic/@Milimetric this requires your approval for the deployment and analytics-privatedata groups respectively. [18:08:11] (03PS1) 10Jforrester: apache: Actually enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941979 (https://phabricator.wikimedia.org/T342794) [18:08:34] (03CR) 10CI reject: [V: 04-1] apache: Actually enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941979 (https://phabricator.wikimedia.org/T342794) (owner: 10Jforrester) [18:09:55] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:10:33] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:10:45] (03PS2) 10Jforrester: apache: Actually enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941979 (https://phabricator.wikimedia.org/T342794) [18:10:53] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:12:03] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:12:10] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:13:52] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P49732 and previous config saved to /var/cache/conftool/dbconfig/20230726-181356-ladsgroup.json [18:15:03] TheresNoTime: Umm. How did you make that edit? [18:15:18] via the global +sysadmin group, I assume [18:15:33] Ah, right, yeah, that'd bypass all the lock-down restructions. [18:15:35] oh crap [18:15:38] Please don't. :-) [18:15:38] apologies [18:15:45] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/941990 (owner: 10Paladox) [18:15:50] No worries, I was just worried all my lock-down stuff didn't apply somehow. [18:15:59] But +sysadmin bypasses (as indeed it does for me). [18:17:36] James_F: any reason why wikifunctionswiki should not be added to https://meta.wikimedia.org/wiki/Special:WikiSets/12? [18:17:46] (and https://meta.wikimedia.org/wiki/Special:WikiSets/7 maybe? not sure if you talked about that somewhere?) [18:18:01] taavi: Please go ahead for 12. [18:18:08] For 7 maaaaybe? [18:18:08] Is it in SUL yet? [18:18:17] yes [18:18:17] Or it's a special wiki? :O [18:18:21] It's an SUL pre-wiki with full lock-down so only staff can edit. [18:18:26] And it's also a special wiki. [18:18:32] noice [18:18:33] "Special" doesn't mean "not SUL". [18:18:39] yeah I know :P [18:18:56] Once it actually /works/, we're going to start letting people edit. [18:19:45] so you can reserve your low user ID today, but editing comes later [18:20:01] Ha, we don't show user IDs much any more. [18:20:08] Certainly not local user IDs. [18:20:43] * TheresNoTime got `42` :D [18:21:19] (03Abandoned) 10Paladox: wikistats: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/941990 (owner: 10Paladox) [18:21:21] (03Abandoned) 10RhinosF1: wikistats: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/941989 (owner: 10RhinosF1) [18:21:25] (03Restored) 10RhinosF1: wikistats: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/941989 (owner: 10RhinosF1) [18:21:54] (03CR) 10RhinosF1: "wrong commit oops" [puppet] - 10https://gerrit.wikimedia.org/r/941989 (owner: 10RhinosF1) [18:24:04] (03CR) 10Paladox: [C: 03+1] wikistats: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/941989 (owner: 10RhinosF1) [18:25:10] (03PS1) 10Jforrester: Wikifunctions: Push ULS language selector to interlanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941981 (https://phabricator.wikimedia.org/T275945) [18:26:56] (03PS1) 10Jforrester: Wikifunctions: Actually allow wikifunctions-staff to make wikitext edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941982 (https://phabricator.wikimedia.org/T275945) [18:27:02] (03CR) 10Urbanecm: [C: 04-1] Add validator userright for pawikisource (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [18:27:10] jouncebot: nowandnext [18:27:11] For the next 0 hour(s) and 32 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1800) [18:27:11] For the next 1 hour(s) and 32 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T1800) [18:27:11] In 1 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T2000) [18:27:16] Meh. [18:27:40] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Push ULS language selector to interlanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941981 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [18:27:44] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Actually allow wikifunctions-staff to make wikitext edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941982 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [18:28:23] (03Merged) 10jenkins-bot: Wikifunctions: Push ULS language selector to interlanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941981 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [18:28:26] (03Merged) 10jenkins-bot: Wikifunctions: Actually allow wikifunctions-staff to make wikitext edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941982 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [18:28:51] (03PS1) 10Cory Massaro: Remove other offending commas from the orchestrator's configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941983 (https://phabricator.wikimedia.org/T342795) [18:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P49734 and previous config saved to /var/cache/conftool/dbconfig/20230726-182902-ladsgroup.json [18:29:50] (03CR) 10Jforrester: [C: 03+2] Remove other offending commas from the orchestrator's configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941983 (https://phabricator.wikimedia.org/T342795) (owner: 10Cory Massaro) [18:30:38] (03Merged) 10jenkins-bot: Remove other offending commas from the orchestrator's configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/941983 (https://phabricator.wikimedia.org/T342795) (owner: 10Cory Massaro) [18:33:01] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:33:04] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:34:13] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:34:45] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:34:54] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:36:05] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:36:24] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:37:21] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:37:55] !log jforrester@deploy1002 Synchronized wmf-config/: Last fixes for initial wikifunctions.org, he says (duration: 06m 44s) [18:39:55] "id": 70, [18:39:56] :P [18:40:24] (03PS1) 10Jforrester: wikifunctions: Use www.wikifunctions.org for Host: [deployment-charts] - 10https://gerrit.wikimedia.org/r/942007 (https://phabricator.wikimedia.org/T342795) [18:40:34] (03PS1) 10Cory Massaro: Qualify www.wikifunctions.org because redirect is not yet set up. [deployment-charts] - 10https://gerrit.wikimedia.org/r/942008 [18:40:46] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Use www.wikifunctions.org for Host: [deployment-charts] - 10https://gerrit.wikimedia.org/r/942007 (https://phabricator.wikimedia.org/T342795) (owner: 10Jforrester) [18:40:50] (03Abandoned) 10Cory Massaro: Qualify www.wikifunctions.org because redirect is not yet set up. [deployment-charts] - 10https://gerrit.wikimedia.org/r/942008 (owner: 10Cory Massaro) [18:41:32] (03Merged) 10jenkins-bot: wikifunctions: Use www.wikifunctions.org for Host: [deployment-charts] - 10https://gerrit.wikimedia.org/r/942007 (https://phabricator.wikimedia.org/T342795) (owner: 10Jforrester) [18:43:24] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:44:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49735 and previous config saved to /var/cache/conftool/dbconfig/20230726-184408-ladsgroup.json [18:44:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:44:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:44:13] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:44:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49736 and previous config saved to /var/cache/conftool/dbconfig/20230726-184430-ladsgroup.json [18:44:38] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:45:32] (03CR) 10AOkoth: [C: 03+2] wikistats: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/941989 (owner: 10RhinosF1) [18:45:40] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:46:55] (03PS2) 10David Martin: Create puppet scripting for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) [18:47:44] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:48:11] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:48:47] OK, I'm declaring a lid on my prod fun. [18:49:04] wikifunctions.org is mostly sort-of up, in locked-down mode. More soon! [18:50:18] woo! [18:50:55] (03CR) 10JHathaway: DO NOT MERGE: Remove hostname from ssh known_hosts aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941543 (owner: 10JHathaway) [18:54:06] (03PS4) 10Sohom Datta: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) [18:54:56] (03CR) 10Sohom Datta: Add validator userright for pawikisource (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [18:55:02] (03PS5) 10Sohom Datta: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) [19:02:31] (03PS2) 10Cory Massaro: apache: Redirect wikifunctions.org to www.wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941971 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [19:03:06] (03CR) 10AOkoth: [C: 03+2] wmflib::php_version: allow 8.2 [puppet] - 10https://gerrit.wikimedia.org/r/941991 (owner: 10RhinosF1) [19:04:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:14:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:16:30] (03CR) 10AOkoth: [C: 03+2] wikistats::httpd: also support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/941993 (owner: 10RhinosF1) [19:26:58] (03CR) 10Urbanecm: [C: 03+1] "LGTM! Please feel free to schedule for deployment via https://wikitech.wikimedia.org/wiki/Deployments starting next week, when the Proofre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [19:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49737 and previous config saved to /var/cache/conftool/dbconfig/20230726-193014-ladsgroup.json [19:30:19] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P49738 and previous config saved to /var/cache/conftool/dbconfig/20230726-194520-ladsgroup.json [19:48:15] (03PS5) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) [19:52:21] (03PS2) 10Dreamy Jazz: CheckUser event table migration: Write new on group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941021 (https://phabricator.wikimedia.org/T330158) [19:52:36] (03PS3) 10Dreamy Jazz: clienthints: Start collecting client hints data on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941954 (https://phabricator.wikimedia.org/T341110) [19:55:08] (03PS1) 10Cory Massaro: Add timeout values in milliseconds as environment variables. [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 [19:56:05] (03PS6) 10JHathaway: site.pp: Drop top level domain names: .wment .org [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) [19:56:36] (03PS7) 10JHathaway: site.pp: Drop top level domain names: .wmnet .org [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) [19:58:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230726T2000). [20:00:04] Dreamy_Jazz and Dreamy_Jazz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] I can deploy [20:00:21] \o [20:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P49739 and previous config saved to /var/cache/conftool/dbconfig/20230726-200026-ladsgroup.json [20:02:00] Dreamy_Jazz: for the second patch, is there a reason you're going directly to all of group1 instead of doing a smaller increment? [20:02:08] No particular reason. [20:02:15] Amir1: you mentioned you were testing something on mwdebug, are you clear of production so I can deploy? [20:02:17] I'm happy to go to just group0 if you would prefer [20:02:32] I actually forgot [20:02:52] feel free to move forward, let me know once you're done. I'm writing a document [20:02:57] * Amir1 cries in paperwork [20:02:58] Amir1: ack, thanks [20:03:03] Let me know if you would prefer, as I can quickly change the patch [20:03:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:37] Dreamy_Jazz: yes please, just in case the schema change was missed somewhere it's nice to have a smaller blast radius [20:03:46] Sure. Will update that patch now [20:05:15] (03PS3) 10Dreamy Jazz: CheckUser event table migration: Write new on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941021 (https://phabricator.wikimedia.org/T330158) [20:05:48] Patch updated. [20:05:56] thanks! [20:06:09] (03PS4) 10Majavah: CheckUser event table migration: Write new on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941021 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:06:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941954 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:06:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941021 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:07:30] (03Merged) 10jenkins-bot: clienthints: Start collecting client hints data on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941954 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:07:33] (03Merged) 10jenkins-bot: CheckUser event table migration: Write new on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941021 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:08:02] !log taavi@deploy1002 Started scap: Backport for [[gerrit:941954|clienthints: Start collecting client hints data on testwiki (T341110)]], [[gerrit:941021|CheckUser event table migration: Write new on group0 (T330158)]] [20:08:08] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [20:08:08] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [20:09:41] !log taavi@deploy1002 dreamyjazz and taavi: Backport for [[gerrit:941954|clienthints: Start collecting client hints data on testwiki (T341110)]], [[gerrit:941021|CheckUser event table migration: Write new on group0 (T330158)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD [20:09:42] option) [20:10:05] Are both changes ready to test? [20:10:17] yes. do you need me to push some buttons or run any queries for you to test? [20:10:29] Will need to run some queries. [20:10:34] *need you [20:10:43] I can carry out other testing steps [20:13:48] Testing of the first change (client hints) complete for my part [20:14:56] Please inspect "cu_useragent_clienthints_map" table for testwiki and check if rows exist with "uachm_reference_id" as "576846" and "576847" [20:15:09] Also please inspect "cu_useragent_clienthints" table and see if rows exist [20:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49740 and previous config saved to /var/cache/conftool/dbconfig/20230726-201533-ladsgroup.json [20:15:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:15:38] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:15:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:15:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T342617)', diff saved to https://phabricator.wikimedia.org/P49741 and previous config saved to /var/cache/conftool/dbconfig/20230726-201554-ladsgroup.json [20:16:30] Dreamy_Jazz: the cu_useragent_clienthints_map table has 22 rows for that condition, and cu_useragent_clienthints has 11 rows [20:16:46] That is as expected [20:16:55] Proceeding to test the next one [20:17:01] Going to use officewiki to test [20:17:35] Actually scrap that, I will use a different testwiki on group0 [20:17:53] I will use testwikidatawiki [20:20:40] Okay my steps are complete [20:21:44] Please inspect that on "testwikidatawiki" in "cu_log_event" there exists one row with "cule_log_id" as "247074". [20:21:55] Then check that rows exist in the table "cu_private_event" [20:22:36] There should be a row with the "cupe_action" as "checkuser-login-success" [20:23:21] there is a row in cu_private_event (cupe_log_type: 'checkuser-private-event', cupe_log_action: 'login-success'), but cu_log_event is empty [20:23:33] Hmm. [20:23:39] Let me try again with another log action [20:23:58] I think moves may not appear in the recentchanges table [20:24:02] *page creations [20:24:22] Can't seem to use Special:Move on that wiki... [20:24:50] are you missing autoconfirmed? I can fix that [20:25:01] I am probably [20:25:16] Yup. I have no rights. [20:25:22] If I could be given it so I can perform a move. [20:25:41] try now? [20:26:38] Okay. Please check "cu_log_event" again. [20:26:41] Once that's done, there should be rows in "cu_changes" with "cuc_only_for_read_old" column set to "1". [20:26:54] If both of those are fine, then the test is successful. [20:27:22] yep, I can see a row in cu_log_event for the page move, and in cu_changes for both the move and your login [20:27:28] Great! [20:27:32] Thanks for checking [20:27:49] logstash looks ok too, so syncing [20:28:26] The page creation didn't show up in Special:RecentChanges, so it not appearing in "cu_log_event" is expected. [20:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:34:20] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:941954|clienthints: Start collecting client hints data on testwiki (T341110)]], [[gerrit:941021|CheckUser event table migration: Write new on group0 (T330158)]] (duration: 26m 17s) [20:34:26] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [20:34:26] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [20:34:31] deployed [20:34:35] Thanks! [20:34:43] Amir1: I'm done, floor is yours [20:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T342617)', diff saved to https://phabricator.wikimedia.org/P49742 and previous config saved to /var/cache/conftool/dbconfig/20230726-203751-ladsgroup.json [20:37:56] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:44:07] (03CR) 10David Martin: Create puppet scripting for sqooping Wikifunctions tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [20:46:10] (03CR) 10Krinkle: "@Ottomata: This has been locally committed on Beta and rebased for several years. I don't know what this does or why, but I'm hoping you k" [puppet] - 10https://gerrit.wikimedia.org/r/941475 (owner: 10Krinkle) [20:48:36] (03CR) 10Krinkle: "@Content-Transform-Team: This has been locally committed on Beta and rebased for for over two years, since 2020." [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle) [20:50:42] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:54] (03PS6) 10Kamila Součková: add WIP Benthos smoke test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [20:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P49743 and previous config saved to /var/cache/conftool/dbconfig/20230726-205257-ladsgroup.json [20:57:16] (03PS11) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [21:00:13] !log manually attach User:WikiLambda_system to SUL T342811 [21:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:23] T342811: User:WikiLambda system can be registered due to not existing on SUL - https://phabricator.wikimedia.org/T342811 [21:03:14] taavi: Ooh, thanks. [21:04:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [21:04:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [21:04:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye [21:04:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Rem... [21:08:04] (03CR) 10Subramanya Sastry: "I don't know actually. But, I wonder if some VE testing relies on accessing Parsoid in beta? Added Ed and Scott in case they know." [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle) [21:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P49744 and previous config saved to /var/cache/conftool/dbconfig/20230726-210804-ladsgroup.json [21:12:32] taavi: thanks! [21:19:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:20:09] (03CR) 10Andrea Denisse: [C: 03+2] xhgui: Remove xhgui1001 and xhgui1002 node definitions [puppet] - 10https://gerrit.wikimedia.org/r/941550 (https://phabricator.wikimedia.org/T342724) (owner: 10Andrea Denisse) [21:23:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T342617)', diff saved to https://phabricator.wikimedia.org/P49745 and previous config saved to /var/cache/conftool/dbconfig/20230726-212310-ladsgroup.json [21:23:15] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:25:12] (03CR) 10Arlolra: [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle) [21:29:44] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Puppet has been disabled for 604905 seconds, message: testing prometheus https - brett, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:42:01] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10bking) Posting info from today's pairing session w @RKemper: Known-good host: ` bking@wcqs2002:~$ curl -kIL https://localhost/readiness-probe HTTP/1.1 200 OK server: nginx/1.14.2 date... [21:46:28] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wcqs2001.codfw.wmnet [21:50:50] (03PS1) 10Bking: Fix typo in BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/942025 [21:52:04] (03CR) 10Ryan Kemper: [C: 03+1] Fix typo in BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/942025 (owner: 10Bking) [21:52:12] (03CR) 10Bking: [C: 03+2] Fix typo in BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/942025 (owner: 10Bking) [21:53:32] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wcqs2001.codfw.wmnet [21:57:38] 10SRE, 10AS-Report, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Krinkle) [21:58:28] 10SRE, 10AS-Report, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Krinkle) [22:07:55] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10bking) Troubleshooting steps taken so far on wcqs2001: - Restarted envoy, nginx, and wcqs-blazegraph . - Rebooted the host - Verified that Blazegraph WebUI is up and responding on... [22:12:35] (03CR) 10Bking: flink-zk: Initiate new flink::zookeeper role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [22:25:45] (03CR) 10Jforrester: "Let's also bump the version of the docker image for the evaluator at the same time?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [22:30:39] jouncebot: nowandnext [22:30:40] No deployments scheduled for the next 7 hour(s) and 29 minute(s) [22:30:40] In 7 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0600) [22:30:40] In 7 hour(s) and 29 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0600) [22:53:59] (03PS1) 10Jforrester: Update interwiki cache now that wikifunctions is here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942028 [22:54:16] (03CR) 10Jforrester: [C: 03+2] Update interwiki cache now that wikifunctions is here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942028 (owner: 10Jforrester) [22:54:54] (03Merged) 10jenkins-bot: Update interwiki cache now that wikifunctions is here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942028 (owner: 10Jforrester) [23:01:26] !log jforrester@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache now that wikifunctions is here (duration: 06m 52s) [23:15:19] (03PS1) 10Jforrester: Wikifunctions: Add logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942030 [23:26:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10odimitrijevic) Approved [23:51:30] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: adds-changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state