[00:01:15] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1012790 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [00:07:13] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:12:40] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:10] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos: switch to unix socket for performance testing [puppet] - 10https://gerrit.wikimedia.org/r/1012790 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [00:13:14] (03CR) 10Ssingh: [C:03+2] P:cumin: add alias for dnsbox hosts (dns-rec/auth) [puppet] - 10https://gerrit.wikimedia.org/r/1012688 (owner: 10Ssingh) [00:18:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:26:09] (03PS3) 10Ssingh: cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) [00:30:04] (03PS1) 10BryanDavis: tox: Bump Python interpreter to 3.9 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012796 [00:30:07] (03PS1) 10BryanDavis: Add redis image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) [00:31:56] (03CR) 10RLazarus: [C:03+2] mediawiki: Add mwscript labels to the job as well as the pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009373 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [00:33:21] (03Merged) 10jenkins-bot: mediawiki: Add mwscript labels to the job as well as the pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009373 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [00:33:41] (03CR) 10BryanDavis: [C:03+2] tox: Bump Python interpreter to 3.9 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012796 (owner: 10BryanDavis) [00:34:15] (03Merged) 10jenkins-bot: tox: Bump Python interpreter to 3.9 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012796 (owner: 10BryanDavis) [00:37:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012652 [00:37:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012652 (owner: 10TrainBranchBot) [00:43:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:45:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:45:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:48:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:53:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:04:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1012652 (owner: 10TrainBranchBot) [01:04:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:23:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:23:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:34:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:38:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:38:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:41:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:46:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:51:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:07:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:17:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:37:16] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:37] (03PS1) 10RLazarus: mediawiki: Add a comment annotation for mwscript jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012802 (https://phabricator.wikimedia.org/T341553) [02:40:46] (03PS1) 10RLazarus: deployment_server: Label and annotation improvements for mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1012803 (https://phabricator.wikimedia.org/T341553) [02:42:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:44:49] (03CR) 10CI reject: [V:04-1] deployment_server: Label and annotation improvements for mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1012803 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [02:47:51] (03PS2) 10RLazarus: deployment_server: Label and annotation improvements for mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1012803 (https://phabricator.wikimedia.org/T341553) [02:53:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:54:21] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012804 (https://phabricator.wikimedia.org/T219903) [02:58:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:02:16] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:14:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:15:17] (03PS3) 10Andrew Bogott: puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) [03:15:17] (03PS3) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [03:20:42] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:20:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:21:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:22:06] (03CR) 10Andrew Bogott: [C:03+1] "yay tests!" [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) (owner: 10FNegri) [03:24:43] (03CR) 10Andrew Bogott: [C:03+1] openstack: keystone: ensure keystone-admin is restarted when keystone is [puppet] - 10https://gerrit.wikimedia.org/r/992676 (owner: 10Majavah) [03:25:43] (03PS4) 10Andrew Bogott: puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) [03:25:44] (03PS4) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [03:27:01] (03PS5) 10Andrew Bogott: puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) [03:27:01] (03PS5) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [03:28:38] (03PS2) 10Andrew Bogott: hieradata: remove non-private nets from private_reverse_zones [puppet] - 10https://gerrit.wikimedia.org/r/1012351 (owner: 10Majavah) [03:28:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012351 (owner: 10Majavah) [03:36:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:40:35] (03CR) 10Andrew Bogott: [C:03+1] hieradata: remove non-private nets from private_reverse_zones [puppet] - 10https://gerrit.wikimedia.org/r/1012351 (owner: 10Majavah) [03:58:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:58:39] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:59:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:00:05] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9644719 (10Andrew) [04:01:42] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:01:43] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9644721 (10Andrew) a:05Andrew→03Jhancock.wm Sorry for the slow response! I hope I've now included all that you need. [04:01:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:04:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:07:13] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:10:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:13:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:14:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:14:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:17:30] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:17:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:26:07] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:26:14] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:38:40] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2043:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:40:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:40:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:41:59] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:47:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:48:40] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes2043:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:50:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:50:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:55:10] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2043:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:19:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:20:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:30:10] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:35:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:44:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:44:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:47:45] (03PS4) 10KartikMistry: Enable Content/Section translation on some Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010226 (https://phabricator.wikimedia.org/T353510) [05:52:04] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-03-18-111401-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012364 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [05:52:58] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-18-111401-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012364 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [05:54:53] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:55:18] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:56:00] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:56:07] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T0600) [06:02:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:02:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:07:05] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:07:39] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:08:02] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/1012914 (https://phabricator.wikimedia.org/T357089) [06:08:05] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:08:40] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:08:49] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/1012914 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [06:08:51] (03PS1) 10Marostegui: installserver: Do not reimage es1036 [puppet] - 10https://gerrit.wikimedia.org/r/1012915 [06:08:55] !log Updated cxserver to 2024-03-18-111401-production (T353510) [06:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:59] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [06:09:20] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/1012914 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [06:10:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:13:06] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es1036 [puppet] - 10https://gerrit.wikimedia.org/r/1012915 (owner: 10Marostegui) [06:17:40] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:23:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:23:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:33:25] (03CR) 10AOkoth: [C:03+2] miscweb: add security-landing-page values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011028 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [06:33:38] (03PS4) 10AOkoth: miscweb: add security-landing-page values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011028 (https://phabricator.wikimedia.org/T350796) [06:35:07] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:35:14] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:35:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:36:03] (03CR) 10AOkoth: [V:03+2 C:03+2] miscweb: add security-landing-page values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011028 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [06:38:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:38:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:39:40] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [06:41:24] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [06:42:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:47:40] !log aokoth@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [06:47:59] !log aokoth@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [06:48:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:48:38] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:48:42] !log aokoth@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [06:48:59] !log aokoth@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [06:53:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:54:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:56:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:56:39] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:00:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:00:39] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:02:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:03:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:03:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:07:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:08:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:08:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:12:07] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:12:14] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:17:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:21:20] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:21:27] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:22:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:26:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:26:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:27:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:36:22] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:36:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:37:55] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2043:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:40:10] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes2043:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:41:23] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:41:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:43:39] (03PS1) 10KartikMistry: Update MinT to 2024-03-20-072303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012988 (https://phabricator.wikimedia.org/T353791) [07:44:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:44:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:45:11] (03PS1) 10Mabualruz: MW Config - Rename the skin night mode classes to more readable classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012989 (https://phabricator.wikimedia.org/T359983) [07:52:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:52:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:53:36] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-03-20-072303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012988 (https://phabricator.wikimedia.org/T353791) (owner: 10KartikMistry) [07:55:05] (03Merged) 10jenkins-bot: Update MinT to 2024-03-20-072303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012988 (https://phabricator.wikimedia.org/T353791) (owner: 10KartikMistry) [07:56:24] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:57:40] (KubernetesRsyslogDown) firing: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:00:00] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T358417#9644811 (10ayounsi) 14The counters are for failed packets and not dropped packets due to saturation (that's a different counter). So there is something wrong somewhere, and looks like it's not the cable or the NIC... [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:38] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [08:02:03] * kart_ is here. [08:02:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1475:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1475 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:02:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010226 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:03:26] (03Merged) 10jenkins-bot: Enable Content/Section translation on some Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010226 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:04:44] !log kartik@deploy2002 Started scap: Backport for [[gerrit:1010226|Enable Content/Section translation on some Wikipedias (T353510)]] [08:04:49] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:07:13] !log kartik@deploy2002 kartik: Backport for [[gerrit:1010226|Enable Content/Section translation on some Wikipedias (T353510)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:30] !log kartik@deploy2002 kartik: Continuing with sync [08:12:10] (03PS1) 10Muehlenhoff: Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) [08:12:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:22] (03CR) 10CI reject: [V:04-1] Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [08:16:15] (03PS1) 10Hashar: Fix commit-message-validator being always successful [puppet] - 10https://gerrit.wikimedia.org/r/1012994 (https://phabricator.wikimedia.org/T360460) [08:16:16] (03CR) 10Muehlenhoff: [C:03+1] peopleweb: set envoy::ssl_provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [08:17:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:31] (03PS2) 10Muehlenhoff: Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) [08:17:59] (03PS3) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) [08:18:07] (03CR) 10Brouberol: Add template rendering external services egress NetworkPolicy resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:19:00] (03CR) 10CI reject: [V:04-1] Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [08:20:31] (03CR) 10Marostegui: [C:03+2] Fix commit-message-validator being always successful [puppet] - 10https://gerrit.wikimedia.org/r/1012994 (https://phabricator.wikimedia.org/T360460) (owner: 10Hashar) [08:21:19] (03PS1) 10Filippo Giunchedi: prometheus: scrape envoy on k8s metrics with 'usedonly' [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) [08:21:37] 10SRE-swift-storage, 06Commons, 06serviceops: Commons thumbnails are broken for certain large sizes of thumbnail images - https://phabricator.wikimedia.org/T358738#9644839 (10MatthewVernon) Yes, we don't replicate thumbnails between DCs any more (and this has been the case since July 2022 cf. T313102) [08:21:51] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:1010226|Enable Content/Section translation on some Wikipedias (T353510)]] (duration: 17m 06s) [08:21:55] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:23:30] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [08:26:03] (03PS3) 10Muehlenhoff: Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) [08:29:49] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [08:30:37] (03CR) 10Filippo Giunchedi: "Deployment plan is to stop puppet and enable it on say prometheus2005 first and check the results" [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [08:37:42] (03CR) 10Slyngshede: "Inline question/nit." [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [08:37:44] (03PS1) 10Muehlenhoff: aptrepo: Remove obsolete migration code [puppet] - 10https://gerrit.wikimedia.org/r/1012996 (https://phabricator.wikimedia.org/T331613) [08:41:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:42:14] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:45] (03CR) 10Clément Goubert: [C:03+1] kubernetes: migrate 5 eqiad appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1009309 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [08:46:56] (03CR) 10Muehlenhoff: Add a stub role to build Apereo CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [08:47:58] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [08:51:19] !log installing systemd updates from bookworm point release [08:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:55] (03CR) 10Slyngshede: Add a stub role to build Apereo CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [08:56:01] (03CR) 10Slyngshede: [C:03+1] Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [08:57:20] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [08:57:41] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: Add v6 static route to VM [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:58:38] !log Depooling mw1368.eqiad.wmnet,mw1369.eqiad.wmnet,mw1370.eqiad.wmnet,mw1478.eqiad.wmnet,mw1479.eqiad.wmnet - T351074 [08:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:43] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [08:59:03] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [08:59:20] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: Add v6 static route to VM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:59:30] !log Update MinT to 2024-03-20-072303-production (T353791, T340956) [08:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:36] T353791: Improve MinT support for rich text - https://phabricator.wikimedia.org/T353791 [08:59:36] T340956: Proof-of-concept for showing a machine translated sections of Wikipedia articles - https://phabricator.wikimedia.org/T340956 [09:00:16] (03CR) 10JMeybohm: [C:03+1] "Sounds plausible. Although we usually only configure clusters that are used by the service using the envoy - so maybe we don't gain much h" [puppet] - 10https://gerrit.wikimedia.org/r/1012995 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [09:00:20] 10SRE-swift-storage, 06Commons, 06serviceops: Commons thumbnails are broken for certain large sizes of thumbnail images - https://phabricator.wikimedia.org/T358738#9644903 (10akosiaris) >>! In T358738#9644283, @tstarling wrote: > I thought there was no cross-DC replication of thumbnails. T299125#8221206 seem... [09:01:50] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9644912 (10Gehel) p:05Triage→03Medium [09:02:17] 06SRE, 06Data-Platform-SRE: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9644910 (10Gehel) p:05Triage→03Medium [09:02:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [09:02:34] (03CR) 10Clément Goubert: [C:03+2] kubernetes: migrate 5 eqiad appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1009309 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [09:03:46] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9644925 (10Gehel) p:05Triage→03High [09:04:02] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9644927 (10Gehel) [09:06:22] (03CR) 10Muehlenhoff: [C:03+2] Add a stub role to build Apereo CAS [puppet] - 10https://gerrit.wikimedia.org/r/1012990 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [09:06:23] 06SRE, 06Infrastructure-Foundations, 10netops, 07Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177#9644950 (10Gehel) [09:11:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:12:02] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1368.eqiad.wmnet with OS bullseye [09:12:25] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1369.eqiad.wmnet with OS bullseye [09:12:53] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1370.eqiad.wmnet with OS bullseye [09:13:16] (03PS1) 10Slyngshede: IDP: Switchback to Bullseye host. [dns] - 10https://gerrit.wikimedia.org/r/1013001 (https://phabricator.wikimedia.org/T357748) [09:13:21] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1478.eqiad.wmnet with OS bullseye [09:13:47] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1479.eqiad.wmnet with OS bullseye [09:16:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1013001 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:18:12] (03PS2) 10Majavah: openstack: keystone: ensure keystone-admin is restarted when keystone is [puppet] - 10https://gerrit.wikimedia.org/r/992676 [09:18:28] (03CR) 10Majavah: [C:03+2] hieradata: remove non-private nets from private_reverse_zones [puppet] - 10https://gerrit.wikimedia.org/r/1012351 (owner: 10Majavah) [09:19:02] (03PS19) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [09:19:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [09:19:17] (03PS2) 10Majavah: Remove labtesttoolsadmin [dns] - 10https://gerrit.wikimedia.org/r/1010884 [09:19:35] (03CR) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [09:19:37] !log rolling-restart memcached on swift-fe-eqiad [09:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:42] (03CR) 10Majavah: [C:03+2] openstack: keystone: ensure keystone-admin is restarted when keystone is [puppet] - 10https://gerrit.wikimedia.org/r/992676 (owner: 10Majavah) [09:20:21] (03PS5) 10Brouberol: global_config: add presto/druid/IDP node IPs to the k8s global config [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [09:20:55] (03PS21) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [09:20:59] (03CR) 10Majavah: [C:03+2] Remove labtesttoolsadmin [dns] - 10https://gerrit.wikimedia.org/r/1010884 (owner: 10Majavah) [09:21:47] (03CR) 10CI reject: [V:04-1] external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:22:24] Appservers alert above known [09:22:38] (03PS2) 10Ayounsi: Routed Ganeti: use per tap interface dhcrelay [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) [09:23:12] (03PS6) 10Brouberol: global_config: rework external services data structure [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [09:23:50] (03CR) 10CI reject: [V:04-1] Routed Ganeti: use per tap interface dhcrelay [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:24:12] (03CR) 10Ayounsi: Routed Ganeti: use per tap interface dhcrelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:24:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [09:24:34] (03PS1) 10Muehlenhoff: Configure idp-test as build host [puppet] - 10https://gerrit.wikimedia.org/r/1013003 (https://phabricator.wikimedia.org/T357748) [09:25:09] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013003 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [09:25:33] (03CR) 10David Caro: [C:03+1] dynamicproxy: use http 1.1 for backend connections [puppet] - 10https://gerrit.wikimedia.org/r/1012728 (https://phabricator.wikimedia.org/T354116) (owner: 10Majavah) [09:25:45] (03PS2) 10Slyngshede: IDP: Switchback to Bullseye host. [dns] - 10https://gerrit.wikimedia.org/r/1013001 (https://phabricator.wikimedia.org/T357748) [09:26:05] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: host reimage [09:26:17] (03PS1) 10Jelto: gitlab: remove duplicate key type from gitlab known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013004 (https://phabricator.wikimedia.org/T337107) [09:26:19] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1478.eqiad.wmnet with reason: host reimage [09:26:20] (03PS3) 10Ayounsi: Routed Ganeti: use per tap interface dhcrelay [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) [09:26:21] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1369.eqiad.wmnet with reason: host reimage [09:26:55] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1479.eqiad.wmnet with reason: host reimage [09:27:09] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1370.eqiad.wmnet with reason: host reimage [09:27:56] (03CR) 10Slyngshede: [C:03+2] IDP: Switchback to Bullseye host. [dns] - 10https://gerrit.wikimedia.org/r/1013001 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:28:31] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: host reimage [09:30:56] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1370.eqiad.wmnet with reason: host reimage [09:31:01] (03CR) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:31:27] (03PS6) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) [09:32:07] claime: could I use one of your re-image to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003416 ? [09:32:30] basically make sure it didn't break the current workflow [09:32:48] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1479.eqiad.wmnet with reason: host reimage [09:33:19] (03CR) 10Muehlenhoff: [C:03+2] Configure idp-test as build host [puppet] - 10https://gerrit.wikimedia.org/r/1013003 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [09:33:48] I think they're all through that part unfortunately [09:34:33] I can set one aside to re-reimage, I'll just need it back by 1400UTC :p [09:34:36] XioNoX: ^ [09:35:01] it's ok, I'll use one of the sretest [09:35:01] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1478.eqiad.wmnet with reason: host reimage [09:35:16] ^ more coming? : [09:35:17] :) [09:35:42] XioNoX: I sent 5 at about 1 minute interval [09:35:48] (03CR) 10Ayounsi: [C:03+2] Add support for routed Ganeti in D-I early_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:36:18] That's the pre-first-puppet-run downtime :p [09:37:08] I see, thx! [09:37:13] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1369.eqiad.wmnet with reason: host reimage [09:38:34] (03PS22) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [09:38:57] (03CR) 10Jelto: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1664/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [09:39:49] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1003.wikimedia.org with OS bullseye [09:40:52] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013004 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [09:41:34] 06SRE, 06Traffic: Migrate purged away from cergen-issued certificate - https://phabricator.wikimedia.org/T360506 (10MoritzMuehlenhoff) 03NEW [09:43:22] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [09:44:41] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9645055 (10MoritzMuehlenhoff) [09:48:19] (03CR) 10Muehlenhoff: [C:03+2] aptrepo: Remove obsolete migration code [puppet] - 10https://gerrit.wikimedia.org/r/1012996 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [09:48:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1368.eqiad.wmnet with OS bullseye [09:50:51] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1370.eqiad.wmnet with OS bullseye [09:51:34] (03PS1) 10Effie Mouzeli: wmnet: Update DNS records for master dbs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) [09:52:01] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1479.eqiad.wmnet with OS bullseye [09:52:08] (03PS2) 10Effie Mouzeli: wmnet: Update DNS records for master dbs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) [09:52:09] cool, test successful ! [09:52:25] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [09:53:32] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1478.eqiad.wmnet with OS bullseye [09:54:02] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 200132 [09:54:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 200132 [09:54:40] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [09:54:48] (03PS3) 10Effie Mouzeli: DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) [09:56:47] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1369.eqiad.wmnet with OS bullseye [09:57:22] !log running homer 'cr*eqiad*' commit 'T351074' [09:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:27] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [09:57:53] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:59:15] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [10:00:29] (03PS1) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [10:01:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [10:02:53] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:03:00] (03PS1) 10Fabfur: benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) [10:03:04] !log STOP persistRevisionThreadItems on viwiki for T315510, will restart after DC switch is done (resume at: --start '["17099868"]') [10:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:21] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [10:04:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9645129 (10MoritzMuehlenhoff) [10:07:17] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1003.wikimedia.org with OS bullseye [10:08:04] (03CR) 10Stevemunene: [C:03+1] ATS: redirect superset.wikimedia.org to the kubernetes deployment [puppet] - 10https://gerrit.wikimedia.org/r/1011359 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [10:08:16] !log revoke labweb.discovery.wmnet cergen cert, migrated to cfssl [10:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:23] (03CR) 10Stevemunene: [C:03+1] idp: update the superset OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1011360 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [10:09:03] (03CR) 10Stevemunene: [C:03+1] superset: cleanup references to old temporary domains [puppet] - 10https://gerrit.wikimedia.org/r/1011361 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [10:10:54] (03PS1) 10Majavah: Remove labweb cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1013009 [10:11:03] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1368.eqiad.wmnet|mw1369.eqiad.wmnet|mw1370.eqiad.wmnet|mw1478.eqiad.wmnet|mw1479.eqiad.wmnet),cluster=kubernetes,service=kubesvc [10:14:40] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:15:32] (03CR) 10Brouberol: [C:03+2] idp: update the superset OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1011360 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [10:15:41] (03CR) 10Brouberol: [C:03+2] ATS: redirect superset.wikimedia.org to the kubernetes deployment [puppet] - 10https://gerrit.wikimedia.org/r/1011359 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [10:16:18] !log roll-restarting changeprop in eqiad [10:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:21] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [10:17:02] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [10:17:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm [10:17:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1013009 (owner: 10Majavah) [10:17:49] (03PS1) 10Ladsgroup: Set three more wikis to read new in pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013010 (https://phabricator.wikimedia.org/T351237) [10:19:32] (03CR) 10Marostegui: [C:04-1] "Missing:" [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [10:20:47] jouncebot: nowandnext [10:20:47] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [10:20:47] In 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1100) [10:21:05] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9645191 (10MoritzMuehlenhoff) [10:22:25] !log migrating superset to Kubernetes. Some CAS errors are expected during ~15 minutes [10:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:29] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for idm-test1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:22:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for idm-test1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:25:16] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for idm-test1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:26:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for idm-test1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:26:37] (03PS1) 10Clément Goubert: Revert "Add File:Claus_-_Conkle to blacklist" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012771 [10:27:30] (03CR) 10Alexandros Kosiaris: [C:03+1] Revert "Add File:Claus_-_Conkle to blacklist" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012771 (owner: 10Clément Goubert) [10:28:44] (03CR) 10Alexandros Kosiaris: [C:04-1] "a version pin with version: 0.13.x in helmfile.yaml would do this better" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012771 (owner: 10Clément Goubert) [10:31:13] (03PS1) 10Muehlenhoff: sre.puppet.renew-cert: Extend help text for --installer [cookbooks] - 10https://gerrit.wikimedia.org/r/1013012 [10:31:36] !log rolling back changeprop to previous version [10:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:40] (03CR) 10Majavah: [C:03+2] Remove labweb cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1013009 (owner: 10Majavah) [10:34:17] (03PS1) 10Muehlenhoff: Remove labweb.discovery.wmnet dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013014 [10:36:23] (03PS9) 10Gmodena: Add webrequest.frontend.rc0 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:36:49] (03PS1) 10Majavah: P:toolforge::k8s::etcd: load checker hosts from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/1013015 (https://phabricator.wikimedia.org/T360514) [10:41:31] (03PS1) 10Majavah: O:toolforge::checker: remove apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/1013016 [10:41:59] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:10] (03PS2) 10Majavah: O:toolforge::checker: remove apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/1013016 (https://phabricator.wikimedia.org/T360514) [10:43:44] (03CR) 10Majavah: [C:03+2] O:toolforge::checker: remove apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/1013016 (https://phabricator.wikimedia.org/T360514) (owner: 10Majavah) [10:48:10] (03PS1) 10Alexandros Kosiaris: changeprop: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013018 [10:49:28] (03CR) 10Giuseppe Lavagetto: [C:03+1] changeprop: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013018 (owner: 10Alexandros Kosiaris) [10:49:47] (03CR) 10Majavah: [C:03+1] "Thanks! Forgot that this existed too." [labs/private] - 10https://gerrit.wikimedia.org/r/1013014 (owner: 10Muehlenhoff) [10:49:48] (03CR) 10Alexandros Kosiaris: [C:03+2] changeprop: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013018 (owner: 10Alexandros Kosiaris) [10:49:52] (03CR) 10JMeybohm: [C:03+1] changeprop: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013018 (owner: 10Alexandros Kosiaris) [10:50:11] !log superset.wikimedia.org is now migrated to the DSE k8s cluster, CAS errors have receeded [10:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:59] (03Merged) 10jenkins-bot: changeprop: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013018 (owner: 10Alexandros Kosiaris) [10:51:31] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:52:51] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove labweb.discovery.wmnet dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013014 (owner: 10Muehlenhoff) [10:54:11] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1100) [11:02:28] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/1013020 (https://phabricator.wikimedia.org/T357093) [11:02:40] (03PS1) 10Clément Goubert: changeprop: Raise memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013021 [11:04:16] (03PS1) 10Ladsgroup: mediawiki: Get rid of purge flaggedrevs [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) [11:06:35] (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop: Raise memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013021 (owner: 10Clément Goubert) [11:07:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013020 (https://phabricator.wikimedia.org/T357093) (owner: 10Muehlenhoff) [11:07:11] (03CR) 10Clément Goubert: [C:03+2] changeprop: Raise memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013021 (owner: 10Clément Goubert) [11:09:23] (03PS2) 10Fabfur: benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) [11:09:48] (03Merged) 10jenkins-bot: changeprop: Raise memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013021 (owner: 10Clément Goubert) [11:09:57] (03CR) 10JMeybohm: [C:03+1] changeprop: Raise memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013021 (owner: 10Clément Goubert) [11:10:20] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:10:26] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:10:38] (03CR) 10CI reject: [V:04-1] benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:10:56] !log bounce apache2 on logstash1031 - T337818 [11:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:01] T337818: apache2 cpu-stuck on logstash1032 causes kafka logging lag - https://phabricator.wikimedia.org/T337818 [11:12:42] (03PS1) 10Clément Goubert: admin_ng: fix missing limitranges stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013024 [11:13:44] (03PS4) 10Kamila Součková: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) [11:14:56] (03CR) 10CI reject: [V:04-1] admin_ng: fix missing limitranges stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013024 (owner: 10Clément Goubert) [11:15:11] (03PS2) 10Clément Goubert: admin_ng: fix missing limitranges stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013024 [11:17:06] (03PS3) 10Fabfur: benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) [11:18:30] (03CR) 10JMeybohm: [C:03+1] admin_ng: fix missing limitranges stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013024 (owner: 10Clément Goubert) [11:18:40] (03CR) 10Clément Goubert: [C:03+2] admin_ng: fix missing limitranges stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013024 (owner: 10Clément Goubert) [11:19:34] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Access to rua-dmarc@wikimedia.org - https://phabricator.wikimedia.org/T360462#9645333 (10Fabfur) a:03Fabfur [11:21:42] (03Merged) 10jenkins-bot: admin_ng: fix missing limitranges stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013024 (owner: 10Clément Goubert) [11:22:18] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:22:49] !log deploying new namespace limits for changeprop [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:55] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:23:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:24:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:25:16] stashbot :'( [11:25:28] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:25:53] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:31:14] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:33:47] (03PS4) 10Fabfur: benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) [11:38:12] (03PS5) 10Fabfur: benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) [11:43:48] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:46:08] (03PS1) 10Clément Goubert: changeprop: Revert version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013029 [11:47:07] (03CR) 10Giuseppe Lavagetto: [C:03+1] changeprop: Revert version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013029 (owner: 10Clément Goubert) [11:47:24] (03CR) 10Clément Goubert: [C:03+2] changeprop: Revert version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013029 (owner: 10Clément Goubert) [11:48:07] (03Merged) 10jenkins-bot: changeprop: Revert version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013029 (owner: 10Clément Goubert) [11:50:41] (ConfdResourceFailed) firing: (3) confd resource _var_lib_gdnsd_discovery-swift-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:51:30] (03PS1) 10Slyngshede: Allow users to associate key with a system on upload [software/bitu] - 10https://gerrit.wikimedia.org/r/1013031 (https://phabricator.wikimedia.org/T359543) [11:53:29] (03PS3) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [11:54:02] (03PS1) 10Giuseppe Lavagetto: thumbor: fix swift url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013033 [11:54:25] (03PS1) 10Fabfur: benthos: ensure sequence field is an INT [puppet] - 10https://gerrit.wikimedia.org/r/1013034 (https://phabricator.wikimedia.org/T358109) [11:54:46] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [11:55:01] (03CR) 10Effie Mouzeli: [C:03+1] thumbor: fix swift url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013033 (owner: 10Giuseppe Lavagetto) [11:55:20] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: fix swift url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013033 (owner: 10Giuseppe Lavagetto) [11:55:41] (ConfdResourceFailed) firing: (14) confd resource _var_lib_gdnsd_discovery-swift-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:56:14] (03Merged) 10jenkins-bot: thumbor: fix swift url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013033 (owner: 10Giuseppe Lavagetto) [11:57:15] (03CR) 10MVernon: [C:03+1] "Late to the party, but this looks good to me, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013033 (owner: 10Giuseppe Lavagetto) [11:57:41] (03PS1) 10Clément Goubert: changeprop: Exclude File:Eaddy - English [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) [11:59:54] (03PS4) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [12:01:24] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: 14Access to rua-dmarc@wikimedia.org - 14https://phabricator.wikimedia.org/T360462#9645440 (10Fabfur) 05Open→03Resolved 14Sent information about dmarc address privately (mail) to the ticket author [12:01:27] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:02:19] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/1013020 (https://phabricator.wikimedia.org/T357093) (owner: 10Muehlenhoff) [12:02:54] (03PS5) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [12:03:08] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T358417#9645444 (10jcrespo) 14Thanks, that's actually useful context I didn't know (the alert description was not very understandable to me). I will keep an eye on the network performance and share my findings. [12:04:11] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:05:12] (03PS2) 10Clément Goubert: changeprop: Exclude commons files with 100+ pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) [12:07:26] (ProbeDown) firing: (2) Service puppetmaster1002:8141 has failed probes (http_puppetmaster1002_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:17] ^ alert lag, this host has dropped the puppetmaster role earlier [12:11:08] (03PS2) 10Slyngshede: Allow users to associate key with a system on upload [software/bitu] - 10https://gerrit.wikimedia.org/r/1013031 (https://phabricator.wikimedia.org/T359543) [12:12:52] (03PS3) 10Clément Goubert: changeprop: Exclude commons files with 100+ pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) [12:15:44] (03CR) 10Gmodena: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1013034 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:16:04] (03PS1) 10Muehlenhoff: Point codfw urldownloaders to 2003 [dns] - 10https://gerrit.wikimedia.org/r/1013037 [12:17:40] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:19] ^ decom in progress for apt2001, fixing [12:20:30] (03CR) 10Fabfur: [C:03+2] benthos: ensure sequence field is an INT [puppet] - 10https://gerrit.wikimedia.org/r/1013034 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:21:38] (03CR) 10Effie Mouzeli: [C:03+1] changeprop: Exclude commons files with 100+ pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [12:22:26] (ProbeDown) resolved: (2) Service puppetmaster1002:8141 has failed probes (http_puppetmaster1002_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:53] (03PS6) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [12:24:10] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:24:45] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::etcd: load checker hosts from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/1013015 (https://phabricator.wikimedia.org/T360514) (owner: 10Majavah) [12:25:32] (03CR) 10Muehlenhoff: [C:03+2] Point codfw urldownloaders to 2003 [dns] - 10https://gerrit.wikimedia.org/r/1013037 (owner: 10Muehlenhoff) [12:27:25] (SystemdUnitFailed) firing: (3) rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:32] (03CR) 10Brouberol: [C:03+2] superset: cleanup references to old temporary domains [puppet] - 10https://gerrit.wikimedia.org/r/1011361 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [12:27:42] (03PS2) 10Brouberol: superset: cleanup references to old temporary domains [puppet] - 10https://gerrit.wikimedia.org/r/1011361 (https://phabricator.wikimedia.org/T358480) [12:35:38] (03CR) 10Brouberol: [V:03+2 C:03+2] superset: cleanup references to old temporary domains [puppet] - 10https://gerrit.wikimedia.org/r/1011361 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [12:35:45] (03PS1) 10Majavah: O:puppetserver: enable openstack stale cert exporter [puppet] - 10https://gerrit.wikimedia.org/r/1013039 [12:36:10] (03PS1) 10Fabfur: benthos: moved batching as close to the input as possible [puppet] - 10https://gerrit.wikimedia.org/r/1013040 (https://phabricator.wikimedia.org/T360454) [12:37:23] (03CR) 10Majavah: [C:03+1] puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [12:37:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1013031 (https://phabricator.wikimedia.org/T359543) (owner: 10Slyngshede) [12:37:58] (03CR) 10Slyngshede: [C:03+2] Allow users to associate key with a system on upload [software/bitu] - 10https://gerrit.wikimedia.org/r/1013031 (https://phabricator.wikimedia.org/T359543) (owner: 10Slyngshede) [12:39:21] (03Merged) 10jenkins-bot: Allow users to associate key with a system on upload [software/bitu] - 10https://gerrit.wikimedia.org/r/1013031 (https://phabricator.wikimedia.org/T359543) (owner: 10Slyngshede) [12:39:37] (03CR) 10Gehel: [C:03+1] superset: cleanup references to old temporary domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011362 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [12:39:53] (03CR) 10Gehel: [C:03+1] superset: cleanup references to old temporary domains [dns] - 10https://gerrit.wikimedia.org/r/1011363 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [12:43:35] !log Depooled swift-rw from codfw [12:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] (03CR) 10Brouberol: [C:03+2] superset: cleanup references to old temporary domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011362 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [12:48:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [12:48:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [12:50:04] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:51:18] (03PS4) 10Effie Mouzeli: DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) [12:52:06] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:57:45] !log installing tiff security updates [12:57:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [12:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [12:58:35] 10SRE-Access-Requests, 06collaboration-services, 06Gerrit-Privilege-Requests: Add dani to wmf-deployment - https://phabricator.wikimedia.org/T360521 (10Jelto) 03NEW [12:59:18] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test2002.wikimedia.org with OS bookworm [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:01:13] (03CR) 10Andrew Bogott: [C:03+2] puppetserver: add puppet7-facts-export-nodb.py [puppet] - 10https://gerrit.wikimedia.org/r/1012764 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [13:01:42] yup, nothing to deploy [13:01:49] (and DC switchover in one hour ^^) [13:02:21] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:03:21] (03CR) 10Jelto: [C:03+2] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012804 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [13:04:29] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012804 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [13:06:00] (03PS5) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 [13:06:19] (03PS1) 10Slyngshede: idp-test - Switch to upgraded Bookworm host. [dns] - 10https://gerrit.wikimedia.org/r/1013043 (https://phabricator.wikimedia.org/T357748) [13:06:50] (03PS1) 10JMeybohm: changeprop-jobqueue: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013044 [13:08:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1013043 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:08:24] !log installing imagemagick security updates [13:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:47] (03CR) 10Slyngshede: [C:03+2] idp-test - Switch to upgraded Bookworm host. [dns] - 10https://gerrit.wikimedia.org/r/1013043 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:09:19] (03PS2) 10Brouberol: superset: cleanup references to old temporary domains [dns] - 10https://gerrit.wikimedia.org/r/1011363 (https://phabricator.wikimedia.org/T358480) [13:10:36] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:10:43] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:11:43] (03CR) 10Brouberol: [C:03+2] superset: cleanup references to old temporary domains [dns] - 10https://gerrit.wikimedia.org/r/1011363 (https://phabricator.wikimedia.org/T358480) (owner: 10Brouberol) [13:16:07] !og installing libuv1 security updates on bullseye [13:17:22] (03PS5) 10Effie Mouzeli: DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) [13:20:27] (03PS1) 10Slyngshede: P:idp-test remove build host from list of valid IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013046 (https://phabricator.wikimedia.org/T357748) [13:20:47] !log manually scaled up changeprop replicas in eqiad from 12 to 15 [13:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:52] (03PS1) 10KartikMistry: Update cxserver to 2024-03-20-072017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013047 (https://phabricator.wikimedia.org/T352739) [13:21:29] (03CR) 10Ladsgroup: [C:03+1] DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:25:01] (03PS1) 10Brouberol: Superset: remove all resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1013048 (https://phabricator.wikimedia.org/T358570) [13:25:29] (03PS1) 10Jelto: gitlab: temporary allow dockerfile frontend on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) [13:30:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 29.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:31:51] (03PS2) 10Brouberol: Superset: remove all resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1013048 (https://phabricator.wikimedia.org/T358570) [13:32:09] !log 13:16 UTC: installing libuv1 security updates on bullseye [re-log, original message wasn’t logged] [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:50] (03CR) 10Slyngshede: [C:03+2] P:idp-test remove build host from list of valid IDP hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1013046 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:35:11] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 06Gerrit-Privilege-Requests: 14Add dani to wmf-deployment - 14https://phabricator.wikimedia.org/T360521#9645626 (10taavi) 05Open→03Resolved a:03taavi 14Done. [13:35:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 27.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:35:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013046 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:36:48] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1009479 (owner: 10Muehlenhoff) [13:37:33] (03CR) 10Marostegui: [C:03+1] DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:38:48] (03CR) 10Alexandros Kosiaris: [C:03+2] changeprop-jobqueue: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013044 (owner: 10JMeybohm) [13:40:02] (03Merged) 10jenkins-bot: changeprop-jobqueue: Add IPv6 network policies for rdb101{1,2} [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013044 (owner: 10JMeybohm) [13:41:22] dear deployers, we will be locking scap shortly [13:41:38] 10ops-eqiad, 06SRE, 10procurement: 14install (2) 1.92TB SSDs from decom into prometheus100[56] - 14https://phabricator.wikimedia.org/T359632#9645646 (10Jclark-ctr) 05Open→03Resolved 14drives installed  [13:41:55] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:41:58] (03PS4) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [13:42:27] !log update chageprop-jobqueue to include rdb101{1,2} IPv6 related netpols [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:40] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:42:53] (03CR) 10Klausman: "I had originally considered doing this and the rest separately (since parts of it are namespace/cluster-specific at first), but I've now r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:42:57] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:43:46] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:44:07] (03PS23) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [13:45:30] (03CR) 10CI reject: [V:04-1] admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:47:12] (03PS6) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [13:47:16] (03PS1) 10Andrew Bogott: Restore puppet-facts-export-nodb.sh [puppet] - 10https://gerrit.wikimedia.org/r/1013051 [13:47:24] (03CR) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:48:50] !log jiji@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T357547 [13:48:52] (03CR) 10Majavah: [C:03+1] Restore puppet-facts-export-nodb.sh [puppet] - 10https://gerrit.wikimedia.org/r/1013051 (owner: 10Andrew Bogott) [13:48:59] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [13:50:27] (03CR) 10Scott French: [C:03+1] DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [13:55:19] (03CR) 10Andrew Bogott: [C:03+2] Restore puppet-facts-export-nodb.sh [puppet] - 10https://gerrit.wikimedia.org/r/1013051 (owner: 10Andrew Bogott) [13:55:51] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [13:55:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [13:56:36] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [13:56:46] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [13:57:06] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [13:57:13] (03CR) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:57:38] (03PS2) 10Brouberol: admin-ng: Define external services namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013007 (https://phabricator.wikimedia.org/T360508) [13:58:32] (03PS5) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [13:58:34] switchover coordination will be in #wikimedia-sre, please join there if you have anything related [14:00:05] Deploy window Northward Switchover: MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1400) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1400) [14:00:29] getting ready, waiting for TTLs to coalesce [14:01:23] (03CR) 10CI reject: [V:04-1] admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:02:47] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [14:03:32] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [14:03:46] !log jiji@cumin1002 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [14:08:25] (SystemdUnitFailed) firing: (6) mediawiki_job_growthexperiments-listTaskCounts.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:30] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9645712 (10AndrewTavis_WMDE) [14:10:39] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9645716 (10karapayneWMDE) Notes for Wikidata Dev Team: task needs like 30 mins of sync between a DOT team member (likely @Lucas_Werkmeister_WMDE ) and... [14:13:25] (SystemdUnitFailed) resolved: (6) mediawiki_job_growthexperiments-listTaskCounts.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:31] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [14:13:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [14:14:33] (03PS7) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [14:14:33] (03PS1) 10Andrew Bogott: codfw1dev: override hiera lookup hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1013056 [14:15:24] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:15:24] !log jiji@cumin1002 MediaWiki read-only period starts at: 2024-03-20 14:15:24.401121 [14:15:38] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:15:51] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:15:52] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:16:01] ^expected [14:16:06] (03CR) 10Bking: [C:04-1] "Per dcausse concerns above, plus the likelihood that there will be more breaking changes to the opensearch client library, we think it wou" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:16:11] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:16:12] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:16:44] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:16:45] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:16:56] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:16:57] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:17:41] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:17:42] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:18:04] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:18:05] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:18:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:18:07] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:18:19] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:18:20] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:18:32] !log jiji@cumin1002 MediaWiki read-only period ends at: 2024-03-20 14:18:32.727570 [14:18:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:18:44] !log Test write T357547 [14:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:48] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:19:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:19:25] <_joe_> expected [14:19:32] <_joe_> we need to get to the redeployment of jobrunner [14:20:20] effie had a networking issue [14:20:27] <_joe_> oh dear [14:20:27] but apparently it happened right after step 7 [14:20:37] <_joe_> well we need to do the remaining ones [14:20:39] I 'll take over running step 8 [14:20:44] <_joe_> yes please [14:21:21] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner [14:21:25] !log root@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [14:21:25] !log root@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [14:21:42] !log root@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:23:01] !log root@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:23:02] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) [14:24:06] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:24:22] <_joe_> peachy [14:26:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:27:08] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [14:27:40] (KubernetesRsyslogDown) firing: rsyslog on mw2425:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2425 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:27:40] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [14:29:08] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [14:31:29] (03PS2) 10Andrew Bogott: codfw1dev: override hiera lookup hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1013056 [14:31:29] (03PS8) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [14:32:04] 2m 41s of read-only per cookbooks [14:32:40] (KubernetesRsyslogDown) firing: (16) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:33:01] <_joe_> akosiaris: the log lines say 3m 12 secs [14:33:47] Hmm, I think k8s server rsyslogs didn't like that [14:33:54] I'll get on restarting them [14:35:21] (03PS1) 10Filippo Giunchedi: prometheus: update partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1013060 (https://phabricator.wikimedia.org/T359632) [14:35:48] (03PS4) 10Ssingh: cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) [14:36:13] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [14:36:50] (03CR) 10Effie Mouzeli: "done" [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:36:53] (03CR) 10Effie Mouzeli: [C:03+2] DBs: Update DNS records for master DBs to eqiad (switchover #2) [dns] - 10https://gerrit.wikimedia.org/r/1013005 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:37:01] (03CR) 10Herron: [C:03+1] "more drives!" [puppet] - 10https://gerrit.wikimedia.org/r/1013060 (https://phabricator.wikimedia.org/T359632) (owner: 10Filippo Giunchedi) [14:37:17] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:19] claime: it's really odd...we had some alerts like that over the last days and I could not figure out why [14:37:28] mmkubernetes and rsyslog seemed happy all the time [14:37:40] (KubernetesRsyslogDown) firing: (22) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:37:53] My wild guess is that rsyslog silent crashes when there's too many messages and it needs to start a new worker [14:38:24] lovely [14:38:39] yes [14:38:52] haven't had time to test out that theory by telling it to start more workers outright [14:39:44] that theory could explain the somtimes missing logs in logstash as well [14:40:26] (03PS1) 10Brouberol: AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) [14:40:33] which I've not been able to capture any evidence for with the additional omkafka metrics from rsyslog [14:41:06] rsyslog clean [14:41:34] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [14:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:40] (KubernetesRsyslogDown) resolved: (22) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:44:17] (03PS2) 10Brouberol: AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) [14:44:37] (03PS1) 10Effie Mouzeli: maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) [14:44:45] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: update partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1013060 (https://phabricator.wikimedia.org/T359632) (owner: 10Filippo Giunchedi) [14:45:27] !log Starting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [14:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:21] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [14:46:31] (03CR) 10Ladsgroup: [C:03+1] maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:46:52] (03CR) 10Alexandros Kosiaris: [C:03+1] maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:47:51] (03PS2) 10Effie Mouzeli: maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) [14:47:55] (KubernetesRsyslogDown) firing: (23) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:47:57] (03PS3) 10Effie Mouzeli: maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) [14:48:19] (03CR) 10Ladsgroup: maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:49:32] (03CR) 10Effie Mouzeli: [C:03+2] maintenance: Update DNS records for maintenance host (switchover #3) [dns] - 10https://gerrit.wikimedia.org/r/1013064 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:50:18] !log jiji@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T357547 (duration: 61m 28s) [14:50:34] Dear deployers, scap is unlocked! [14:50:36] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:53:14] (03CR) 10Ahmon Dancy: gitlab: temporary allow dockerfile frontend on Trusted Runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013049 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [14:55:08] (03PS3) 10Brouberol: AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) [14:56:31] ssh deployment.eqiad.wmnet still sends me to deploy2002, is that correct? [14:56:50] (I would’ve expected deploy1002, but I also remember the deployment host not being changed during some past switchovers or something similar, so maybe it’s all fine ^^) [14:57:08] Lucas_WMDE: yes, expected [14:57:10] (03PS4) 10Muehlenhoff: AQS1.0: disable aqs service [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [14:57:17] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:21] yeah, plan is to switch deployment hosts tomorrow, deploy2002 is correct atm [14:57:42] (but thanks for checking!) [14:57:55] (KubernetesRsyslogDown) resolved: (8) rsyslog on kubernetes2014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:58:00] alright, thanks! [14:58:11] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1671/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [14:59:06] (03PS3) 10Andrew Bogott: codfw1dev: override hiera lookup hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1013056 [14:59:06] (03PS9) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [14:59:08] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: update a bunch of cloud-vps puppetserver keys [puppet] - 10https://gerrit.wikimedia.org/r/1013066 (https://phabricator.wikimedia.org/T351450) [14:59:15] (03PS1) 10Andrew Bogott: puppet7-facts-export-nodb.py: fix a variable name collision [puppet] - 10https://gerrit.wikimedia.org/r/1013067 [14:59:35] (03PS1) 10Filippo Giunchedi: installserver: update centrallog partman [puppet] - 10https://gerrit.wikimedia.org/r/1013068 (https://phabricator.wikimedia.org/T359451) [14:59:44] (03CR) 10Brouberol: [V:03+1] AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [14:59:54] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki viwiki --current --all --touched-after=20230613000000 --start '["17099868"]' 2>&1 | tee ~/T315510-viwiki-4; date # in tmux; note the changed mwmaint host :) [14:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:12] (03CR) 10Gehel: AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [15:01:29] (03CR) 10Brouberol: [V:03+1] AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [15:02:11] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: update a bunch of cloud-vps puppetserver keys [puppet] - 10https://gerrit.wikimedia.org/r/1013066 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [15:04:05] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: override hiera lookup hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1013056 (owner: 10Andrew Bogott) [15:04:57] (03CR) 10Muehlenhoff: [C:03+2] dumps::generation::server::rsync_firewall: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1009479 (owner: 10Muehlenhoff) [15:07:06] (03PS1) 10Ebernhardson: Cirrus: testcommonswiki only needs 1 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013069 [15:09:29] (03PS1) 10Effie Mouzeli: geo-maps: make eqiad the default datacentre (switchover #4) [dns] - 10https://gerrit.wikimedia.org/r/1013070 (https://phabricator.wikimedia.org/T357547) [15:10:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 16 hosts with reason: Remove circular replication in x1 T358200 [15:10:57] T358200: Database post-switchover tasks March 2024 - https://phabricator.wikimedia.org/T358200 [15:11:04] (03PS2) 10Effie Mouzeli: geo-maps: make eqiad the default datacentre (switchover #4) [dns] - 10https://gerrit.wikimedia.org/r/1013070 (https://phabricator.wikimedia.org/T357547) [15:11:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 16 hosts with reason: Remove circular replication in x1 T358200 [15:13:14] (03PS1) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 [15:17:02] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:17:43] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:17:44] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:18:19] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:18:20] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:18:41] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:19:03] (03CR) 10Ssingh: [C:03+1] geo-maps: make eqiad the default datacentre (switchover #4) [dns] - 10https://gerrit.wikimedia.org/r/1013070 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [15:19:17] (03PS1) 10Muehlenhoff: mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 [15:20:21] (03CR) 10Ssingh: [C:03+1] benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:21:34] (03CR) 10Effie Mouzeli: [C:03+2] geo-maps: make eqiad the default datacentre (switchover #4) [dns] - 10https://gerrit.wikimedia.org/r/1013070 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [15:23:12] (03CR) 10BBlack: [C:03+1] "Seems logically sound, although I can't claim to have reviewed it or understood the base class it uses even 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:24:27] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos/haproxy: fix hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/1013008 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:25:06] (03PS2) 10Andrew Bogott: puppet7-facts-export-nodb.py: fix a variable name collision [puppet] - 10https://gerrit.wikimedia.org/r/1013067 [15:25:06] (03PS10) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [15:25:08] (03PS1) 10Andrew Bogott: codfw1dev puppet: change the 'puppet' alias to point to the new puppet7 server [puppet] - 10https://gerrit.wikimedia.org/r/1013079 (https://phabricator.wikimedia.org/T351450) [15:25:24] (03CR) 10Fabfur: [C:03+2] benthos: moved batching as close to the input as possible [puppet] - 10https://gerrit.wikimedia.org/r/1013040 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [15:25:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [15:28:22] (03PS1) 10Effie Mouzeli: debug.json: List primary DC servers first (switchover #5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013083 (https://phabricator.wikimedia.org/T357547) [15:28:28] !lof installing squid security updates [15:28:32] !log installing squid security updates [15:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:00] hashar: o/ (if you have time) - I got some "no space left on device" when building docker images :( https://integration.wikimedia.org/ci/job/inference-services-pipeline-llm/234/console [15:30:05] !log repooling cp4037 for a little longer than last time (T358109) [15:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:15] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [15:30:31] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:31:09] (03CR) 10Ladsgroup: [C:03+1] debug.json: List primary DC servers first (switchover #5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013083 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [15:33:05] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9646108 (10bking) Unfortunately, we are plus the likelihood that there wi... [15:36:05] (03PS1) 10KartikMistry: Enable ContentTranslation by default for myvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013084 (https://phabricator.wikimedia.org/T353510) [15:36:29] (03CR) 10Muehlenhoff: AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [15:38:25] (03PS3) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) [15:40:12] (03CR) 10Tjones: [C:03+1] "I only have +1 here..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013069 (owner: 10Ebernhardson) [15:45:00] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [15:45:47] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff) [15:48:33] !log installing usbutils bugfix updates from Bookworm point release [15:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: Remove circular replication in es5 T358200 [15:49:20] T358200: Database post-switchover tasks March 2024 - https://phabricator.wikimedia.org/T358200 [15:49:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: Remove circular replication in es5 T358200 [15:50:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: Remove circular replication in es4 T358200 [15:51:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: Remove circular replication in es4 T358200 [15:53:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 27 hosts with reason: Remove circular replication in s6 T358200 [15:53:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 27 hosts with reason: Remove circular replication in s6 T358200 [15:53:49] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9646187 (10MoritzMuehlenhoff) [15:53:54] jouncebot: now [15:53:54] No deployments scheduled for the next 1 hour(s) and 6 minute(s) [15:53:58] !log installing usb.ids bugfix updates from Bookworm point release [15:53:58] jouncebot: next [15:53:58] In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1700) [15:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 29 hosts with reason: Remove circular replication in s2 T358200 [15:54:32] T358200: Database post-switchover tasks March 2024 - https://phabricator.wikimedia.org/T358200 [15:54:47] (03PS1) 10Majavah: P:toolforge::k8s: haproxy: Do not start keepalived too early [puppet] - 10https://gerrit.wikimedia.org/r/1013090 (https://phabricator.wikimedia.org/T349206) [15:54:50] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev puppet: change the 'puppet' alias to point to the new puppet7 server [puppet] - 10https://gerrit.wikimedia.org/r/1013079 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [15:54:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 29 hosts with reason: Remove circular replication in s2 T358200 [15:55:13] (03CR) 10Andrew Bogott: [C:03+2] puppet7-facts-export-nodb.py: fix a variable name collision [puppet] - 10https://gerrit.wikimedia.org/r/1013067 (owner: 10Andrew Bogott) [15:55:41] (ConfdResourceFailed) firing: (14) confd resource _var_lib_gdnsd_discovery-swift-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:56:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 26 hosts with reason: Remove circular replication in s3 T358200 [15:56:11] oh hmm [15:56:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 26 hosts with reason: Remove circular replication in s3 T358200 [15:56:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013083 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [15:57:29] sukhe: anything I can help ? [15:57:36] effie: looking [15:57:47] (03Merged) 10jenkins-bot: debug.json: List primary DC servers first (switchover #5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013083 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [15:57:57] Mar 20 14:17:03 dns1004 confd[1823439]: 2024-03-20T14:17:03Z dns1004 /usr/bin/confd[1823439]: ERROR "updating error mtime on /var/run/confd-template/.discovery-parsoid-php.state1198733073.err\nfailed linting '/usr/local/bin/authdns-check-active-passive /var/lib/gdnsd/.discovery-parsoid-php.state1198733073' with 1 (0.> [15:58:14] !log jiji@deploy2002 Started scap: Backport for [[gerrit:1013083|debug.json: List primary DC servers first (switchover #5) (T357547)]] [15:58:17] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [15:58:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 31 hosts with reason: Remove circular replication in s7 T358200 [15:59:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 31 hosts with reason: Remove circular replication in s7 T358200 [16:00:37] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1672/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013090 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [16:01:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on 27 hosts with reason: Remove circular replication in s5 T358200 [16:01:30] T358200: Database post-switchover tasks March 2024 - https://phabricator.wikimedia.org/T358200 [16:01:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on 27 hosts with reason: Remove circular replication in s5 T358200 [16:01:56] (03CR) 10Jcrespo: [C:03+1] "Looks safe, let me merge it tomorrow so not to do it at the end of my day!" [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [16:02:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9646233 (10MoritzMuehlenhoff) [16:03:01] (03CR) 10Jcrespo: [C:03+1] "maybe just s/revolves/resolves/ on the patch description ?" [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [16:03:34] !log jiji@deploy2002 jiji: Backport for [[gerrit:1013083|debug.json: List primary DC servers first (switchover #5) (T357547)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:03:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on 34 hosts with reason: Remove circular replication in s8 T358200 [16:03:40] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [16:03:54] (03PS2) 10Muehlenhoff: mediabackup::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013076 [16:04:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on 34 hosts with reason: Remove circular replication in s8 T358200 [16:05:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on 36 hosts with reason: Remove circular replication in s4 T358200 [16:05:03] (03CR) 10Muehlenhoff: "Ack, fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [16:05:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on 36 hosts with reason: Remove circular replication in s4 T358200 [16:06:56] (03CR) 10Jcrespo: "Will do, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1013076 (owner: 10Muehlenhoff) [16:07:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on 37 hosts with reason: Remove circular replication in s1 T358200 [16:07:15] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9646275 (10MoritzMuehlenhoff) [16:07:21] T358200: Database post-switchover tasks March 2024 - https://phabricator.wikimedia.org/T358200 [16:07:31] (03PS1) 10Majavah: P:toolforge::checker: do not hardcode list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1013095 (https://phabricator.wikimedia.org/T279078) [16:07:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on 37 hosts with reason: Remove circular replication in s1 T358200 [16:08:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1673/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013095 (https://phabricator.wikimedia.org/T279078) (owner: 10Majavah) [16:10:13] !log jiji@deploy2002 jiji: Continuing with sync [16:12:50] 06SRE, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9646327 (10andrea.denisse) a:03andrea.denisse [16:17:02] !log dbmaint deploy schema change s8 codfw T356166 [16:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:05] !log dbmaint deploy schema change s6 codfw T356166 [16:17:05] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 27 hosts with reason: Schema change [16:18:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 27 hosts with reason: Schema change [16:20:10] (03PS1) 10Aklapper: GerritBot: Avoid Phabricator auto-linking Gerrit change numbers [puppet] - 10https://gerrit.wikimedia.org/r/1013097 [16:20:36] (03CR) 10Aklapper: "Followup attempt in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013097" [puppet] - 10https://gerrit.wikimedia.org/r/1008001 (https://phabricator.wikimedia.org/T358940) (owner: 10Mainframe98) [16:20:41] (ConfdResourceFailed) firing: (14) confd resource _var_lib_gdnsd_discovery-swift-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:21:50] (03PS2) 10Aklapper: GerritBot: Avoid Phabricator auto-linking Gerrit change numbers [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) [16:21:54] (03PS1) 10Urbanecm: Revert "NewcomerTaskStore: update the task queue before finishing loading" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012779 (https://phabricator.wikimedia.org/T360469) [16:22:41] !log jiji@deploy2002 Finished scap: Backport for [[gerrit:1013083|debug.json: List primary DC servers first (switchover #5) (T357547)]] (duration: 24m 27s) [16:22:45] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [16:27:25] (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:41] (ConfdResourceFailed) firing: (14) confd resource _var_lib_gdnsd_discovery-swift-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:30:59] !log rolling restart of confd on A:dnsbox to resolve state state files issue [16:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:31] (03CR) 10Clément Goubert: Add new ceph container image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:35:41] (ConfdResourceFailed) resolved: (14) confd resource _var_lib_gdnsd_discovery-swift-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:36:14] cool :) [16:39:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012766 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [16:40:58] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1012659 (https://phabricator.wikimedia.org/T360546) [16:41:07] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1012660 (https://phabricator.wikimedia.org/T360546) [16:41:28] (03Abandoned) 10Marostegui: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1012660 (https://phabricator.wikimedia.org/T360546) (owner: 10Gerrit maintenance bot) [16:41:34] (03Abandoned) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1012659 (https://phabricator.wikimedia.org/T360546) (owner: 10Gerrit maintenance bot) [16:42:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:43:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:44:14] (03CR) 10Elukey: Add new ceph container image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:47:00] (03PS11) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [16:47:00] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: further attempt to upload cloud-vps puppetserver key [puppet] - 10https://gerrit.wikimedia.org/r/1013100 (https://phabricator.wikimedia.org/T351450) [16:48:27] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2107.codfw.wmnet with OS bullseye [16:49:45] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: further attempt to upload cloud-vps puppetserver key [puppet] - 10https://gerrit.wikimedia.org/r/1013100 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [16:49:58] (03CR) 10JMeybohm: [C:03+1] deployment_server: Label and annotation improvements for mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1012803 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [16:51:19] (03CR) 10JMeybohm: [C:03+1] mediawiki: Add a comment annotation for mwscript jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012802 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [16:55:08] (03PS1) 10Cparle: Removing MachineVision events, extension is being sunsetted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013101 (https://phabricator.wikimedia.org/T347970) [16:56:54] (03PS1) 10Marostegui: es2025: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013103 (https://phabricator.wikimedia.org/T358746) [16:57:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2025 T358746', diff saved to https://phabricator.wikimedia.org/P58826 and previous config saved to /var/cache/conftool/dbconfig/20240320-165710-root.json [16:57:14] T358746: Upgrade es5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358746 [16:57:17] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1013068 (https://phabricator.wikimedia.org/T359451) (owner: 10Filippo Giunchedi) [16:58:27] (03CR) 10Marostegui: [C:03+2] es2025: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013103 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui) [16:58:37] (03CR) 10Brouberol: [V:03+1] AQS1.0: disable aqs service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013063 (https://phabricator.wikimedia.org/T360522) (owner: 10Brouberol) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1700) [17:00:42] (03Merged) 10jenkins-bot: mime: Register `.owl` as application/rdf+xml [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012766 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [17:01:07] !log dancy@deploy2002 Started scap: Backport for [[gerrit:1012766|mime: Register `.owl` as application/rdf+xml (T171807 T359643)]] [17:01:17] T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807 [17:01:17] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [17:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool es2025 T358746', diff saved to https://phabricator.wikimedia.org/P58827 and previous config saved to /var/cache/conftool/dbconfig/20240320-170332-root.json [17:03:37] T358746: Upgrade es5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358746 [17:03:41] !log dancy@deploy2002 dancy: Backport for [[gerrit:1012766|mime: Register `.owl` as application/rdf+xml (T171807 T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:04:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 T358746', diff saved to https://phabricator.wikimedia.org/P58828 and previous config saved to /var/cache/conftool/dbconfig/20240320-170413-root.json [17:04:33] !log dancy@deploy2002 dancy: Continuing with sync [17:04:41] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2107.codfw.wmnet with reason: host reimage [17:05:34] (03PS1) 10Marostegui: es2024: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013106 (https://phabricator.wikimedia.org/T358746) [17:07:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2107.codfw.wmnet with reason: host reimage [17:07:35] (03CR) 10Marostegui: [C:03+2] es2024: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1013106 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui) [17:10:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2113.codfw.wmnet with reason: Maintenance [17:10:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2113.codfw.wmnet with reason: Maintenance [17:12:29] (03PS1) 10Fabfur: benthos: force sequence key to be casted as INT [puppet] - 10https://gerrit.wikimedia.org/r/1013107 (https://phabricator.wikimedia.org/T358109) [17:13:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool es2024 T358746', diff saved to https://phabricator.wikimedia.org/P58829 and previous config saved to /var/cache/conftool/dbconfig/20240320-171356-root.json [17:14:01] T358746: Upgrade es5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358746 [17:15:33] !log dbmaint deploy schema change s2 codfw T356166 [17:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:37] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [17:16:31] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:1012766|mime: Register `.owl` as application/rdf+xml (T171807 T359643)]] (duration: 15m 24s) [17:16:36] T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807 [17:16:36] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [17:18:50] (03PS12) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [17:18:51] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: another round of keys for uploading puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/1013108 (https://phabricator.wikimedia.org/T351450) [17:20:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 16 hosts with reason: Schema change T355609 [17:20:35] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: another round of keys for uploading puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/1013108 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [17:20:35] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:20:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 16 hosts with reason: Schema change T355609 [17:21:38] !log dbmaint deploy schema change s8 codfw T355609 [17:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 13 hosts with reason: Schema change T355609 [17:22:48] !log dbmaint deploy schema change s5 codfw T356166 [17:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:52] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [17:23:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 13 hosts with reason: Schema change T355609 [17:23:40] (03CR) 10Volans: "LGTM, few questions and comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:24:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2107.codfw.wmnet with OS bullseye [17:24:22] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:24:52] (03CR) 10Dzahn: [C:03+2] peopleweb: set envoy::ssl_provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:28:25] (03CR) 10Gmodena: [C:03+1] benthos: force sequence key to be casted as INT [puppet] - 10https://gerrit.wikimedia.org/r/1013107 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [17:31:22] (03CR) 10Fabfur: [C:03+2] benthos: force sequence key to be casted as INT [puppet] - 10https://gerrit.wikimedia.org/r/1013107 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [17:32:07] (ProbeDown) firing: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:33:21] re: people1004 - that looks like it's me.. on it [17:33:28] not the active server [17:34:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on 16 hosts with reason: Schema change T356166 [17:34:08] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [17:34:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 16 hosts with reason: Schema change T356166 [17:35:45] (03CR) 10Dzahn: [C:03+2] "after merge and enabling puppet only on the passive server.. turns out we do need people.wikimedia.org on the cert or monitoring checks st" [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:36:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:37:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:39:07] 06SRE, 10Wikimedia-Mailing-lists: Subscribe Elton to Internal mailing list for Meta-Wiki oversighters - https://phabricator.wikimedia.org/T360263#9646713 (10Elton) I've received a response from the list owner and now I'm subscribed. I think this task can now be closed as resolved. I'd like to thank you all for... [17:40:42] (03PS1) 10Dzahn: peopleweb: add people.wikimedia.org to SAN of cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/1013112 (https://phabricator.wikimedia.org/T360413) [17:43:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P58830 and previous config saved to /var/cache/conftool/dbconfig/20240320-174339-root.json [17:44:46] hi dancy: hashar! can we please deploy https://gerrit.wikimedia.org/r/c/1012779 before rolling the train forward to group1? this fixed a more critical bug that appeared as a regression from trying to fix a less-critical bug. [17:45:08] urbanecm: Yes. Do you want to handle it? [17:45:31] dancy: i can – i just see there's the train window scheduled in ~15 minutes, so that's why i pinged you [17:45:41] Gotcha. I'll wait until you're done. [17:45:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:45:49] (03CR) 10Urbanecm: [C:03+2] Revert "NewcomerTaskStore: update the task queue before finishing loading" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012779 (https://phabricator.wikimedia.org/T360469) (owner: 10Urbanecm) [17:45:53] sounds good, thanks [17:45:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:46:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P58831 and previous config saved to /var/cache/conftool/dbconfig/20240320-174614-root.json [17:46:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012779 (https://phabricator.wikimedia.org/T360469) (owner: 10Urbanecm) [17:47:23] 06SRE, 10Wikimedia-Mailing-lists: Subscribe Elton to Internal mailing list for Meta-Wiki oversighters - https://phabricator.wikimedia.org/T360263#9646735 (10Dzahn) >>! In T360263#9646713, @Elton wrote: **Please mark the list as private**, if it isn't already, before closing the task. This is something that t... [17:48:49] (03PS1) 10Marostegui: db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1013113 [17:49:07] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013112" [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:49:20] (03CR) 10Dzahn: [C:03+2] peopleweb: add people.wikimedia.org to SAN of cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/1013112 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:50:17] (03PS1) 10Fabfur: haproxy: add parameter for optional log length [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) [17:51:06] (03CR) 10Marostegui: [C:03+2] db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1013113 (owner: 10Marostegui) [17:51:25] (03PS3) 10Clare Ming: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [17:53:28] (03CR) 10Clare Ming: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [17:54:03] (03PS1) 10Marostegui: instances.yaml: Remove db2096 [puppet] - 10https://gerrit.wikimedia.org/r/1013116 (https://phabricator.wikimedia.org/T360554) [17:55:36] (03CR) 10Dzahn: [C:03+2] "[people1004:~] $ sudo openssl x509 -noout -ext subjectAltName -in /etc/envoy/ssl/discovery__peopleweb_discovery_wmnet_server.chained.pem" [puppet] - 10https://gerrit.wikimedia.org/r/1013112 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:55:37] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2096 [puppet] - 10https://gerrit.wikimedia.org/r/1013116 (https://phabricator.wikimedia.org/T360554) (owner: 10Marostegui) [17:57:02] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [17:57:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2096 from dbctl', diff saved to https://phabricator.wikimedia.org/P58832 and previous config saved to /var/cache/conftool/dbconfig/20240320-175702-marostegui.json [17:57:15] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1013012 (owner: 10Muehlenhoff) [17:57:28] (03CR) 10Fabfur: haproxy: add parameter for optional log length [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [17:57:33] (03PS2) 10Jdlrobson: Make night theme available on shwiki, exclude additional actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359183) [17:58:11] 06SRE, 10Wikimedia-Mailing-lists: Subscribe Elton to Internal mailing list for Meta-Wiki oversighters - https://phabricator.wikimedia.org/T360263#9646786 (10Ladsgroup) >>! In T360263#9646735, @Dzahn wrote: >>>! In T360263#9646713, @Elton wrote: **Please mark the list as private**, if it isn't already, before c... [17:58:19] 06SRE, 10Wikimedia-Mailing-lists: 14Subscribe Elton to Internal mailing list for Meta-Wiki oversighters - 14https://phabricator.wikimedia.org/T360263#9646787 (10Ladsgroup) 05Open→03Resolved [17:58:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P58833 and previous config saved to /var/cache/conftool/dbconfig/20240320-175844-root.json [17:58:47] (03PS1) 10Marostegui: mariadb: Remove db2096 [puppet] - 10https://gerrit.wikimedia.org/r/1013117 (https://phabricator.wikimedia.org/T360554) [17:59:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2096.codfw.wmnet [18:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T1800) [18:00:33] (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2096 [puppet] - 10https://gerrit.wikimedia.org/r/1013117 (https://phabricator.wikimedia.org/T360554) (owner: 10Marostegui) [18:00:45] (03CR) 10Ssingh: [C:03+1] "I am assuming 16384 is intended on cp4037 but +1." [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [18:01:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P58834 and previous config saved to /var/cache/conftool/dbconfig/20240320-180120-root.json [18:01:27] (03CR) 10Tacsipacsi: GerritBot: Avoid Phabricator auto-linking Gerrit change numbers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) (owner: 10Aklapper) [18:03:12] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [18:03:37] (ProbeDown) resolved: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:05:10] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2096.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [18:06:03] (03PS2) 10Fabfur: haproxy: add parameter for optional log length [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) [18:06:22] (03CR) 10Ssingh: haproxy: add parameter for optional log length [puppet] - 10https://gerrit.wikimedia.org/r/1013114 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [18:06:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2096.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [18:06:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2096.codfw.wmnet [18:07:27] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9646850 (10Dzahn) [18:07:43] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2096 - https://phabricator.wikimedia.org/T360554#9646847 (10Marostegui) a:05Marostegui→03None [18:08:05] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2096 - https://phabricator.wikimedia.org/T360554#9646852 (10Marostegui) This is ready for DC-Ops [18:11:46] (03PS1) 10Dzahn: planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) [18:12:58] (03CR) 10CI reject: [V:04-1] planet: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013120 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:13:18] (03CR) 10Ssingh: "Thanks for the review! Comments in-line and updated patch to follow" [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:13:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P58835 and previous config saved to /var/cache/conftool/dbconfig/20240320-181350-root.json [18:13:52] (03PS5) 10Ssingh: cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) [18:14:20] (03Merged) 10jenkins-bot: Revert "NewcomerTaskStore: update the task queue before finishing loading" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012779 (https://phabricator.wikimedia.org/T360469) (owner: 10Urbanecm) [18:14:48] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1012779|Revert "NewcomerTaskStore: update the task queue before finishing loading" (T360469 T359992)]] [18:14:58] T360469: [wmf.23 testwiki] Post-edit dialog navigation buttons are disabled - https://phabricator.wikimedia.org/T360469 [18:14:58] T359992: [wmf.21] Homepage - empty filter selection can be saved - https://phabricator.wikimedia.org/T359992 [18:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P58836 and previous config saved to /var/cache/conftool/dbconfig/20240320-181626-root.json [18:17:12] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1012779|Revert "NewcomerTaskStore: update the task queue before finishing loading" (T360469 T359992)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:18:22] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:19:05] (03CR) 10CI reject: [V:04-1] cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:19:27] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2108.codfw.wmnet with OS bullseye [18:21:09] (03PS6) 10Ssingh: cookbooks.sre.dns: add roll-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1012719 (https://phabricator.wikimedia.org/T347054) [18:26:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:28:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P58837 and previous config saved to /var/cache/conftool/dbconfig/20240320-182855-root.json [18:30:51] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1012779|Revert "NewcomerTaskStore: update the task queue before finishing loading" (T360469 T359992)]] (duration: 16m 02s) [18:30:57] dancy: i'm done [18:30:58] T360469: [wmf.23 testwiki] Post-edit dialog navigation buttons are disabled - https://phabricator.wikimedia.org/T360469 [18:30:59] T359992: [wmf.21] Homepage - empty filter selection can be saved - https://phabricator.wikimedia.org/T359992 [18:31:04] Thanks. Rolling the train now. [18:31:23] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013123 (https://phabricator.wikimedia.org/T354441) [18:31:24] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013123 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:31:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P58838 and previous config saved to /var/cache/conftool/dbconfig/20240320-183131-root.json [18:31:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:31:43] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9646986 (10Papaul) [18:31:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:32:49] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013123 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:32:50] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9646991 (10Papaul) Removed all old cables and unracked 4 switches out of 8 [18:35:16] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2108.codfw.wmnet with reason: host reimage [18:36:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:37:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2108.codfw.wmnet with reason: host reimage [18:39:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:40:38] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013125 (https://phabricator.wikimedia.org/T354441) [18:40:40] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013125 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:40:44] (03PS13) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [18:40:46] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: yet newer keys for uploading puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/1013126 (https://phabricator.wikimedia.org/T351450) [18:40:49] Rolling the train back due to error rate increas [18:40:56] e [18:41:24] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013125 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [18:41:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:42:15] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: yet newer keys for uploading puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/1013126 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [18:44:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P58839 and previous config saved to /var/cache/conftool/dbconfig/20240320-184400-root.json [18:44:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:46:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P58840 and previous config saved to /var/cache/conftool/dbconfig/20240320-184637-root.json [18:47:42] (03PS3) 10Andrew Bogott: git-sync-upstream.py: run through black [puppet] - 10https://gerrit.wikimedia.org/r/1009799 [18:49:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:50:25] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: 14Access to rua-dmarc@wikimedia.org - 14https://phabricator.wikimedia.org/T360462#9647060 (10Jgreen) [18:50:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9647061 (10Jgreen) [18:50:57] (03CR) 10Andrew Bogott: [C:03+2] git-sync-upstream.py: run through black [puppet] - 10https://gerrit.wikimedia.org/r/1009799 (owner: 10Andrew Bogott) [18:51:17] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9647066 (10Jgreen) [18:52:01] (03CR) 10Andrew Bogott: [C:03+1] O:puppetserver: enable openstack stale cert exporter [puppet] - 10https://gerrit.wikimedia.org/r/1013039 (owner: 10Majavah) [18:52:32] (03CR) 10Andrew Bogott: "Thank you Moritz! The docs I was thinking about are here:" [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [18:54:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2108.codfw.wmnet with OS bullseye [18:56:47] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.23 refs T354441 [18:56:51] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [18:58:52] (03PS1) 10Dzahn: ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) [18:59:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P58841 and previous config saved to /var/cache/conftool/dbconfig/20240320-185906-root.json [18:59:30] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:00:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P58842 and previous config saved to /var/cache/conftool/dbconfig/20240320-190143-root.json [19:02:40] (KubernetesRsyslogDown) firing: rsyslog on mw2449:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2449 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:03:20] (03CR) 10Dzahn: [C:03+2] "also ""sudo rm /etc/ssl/localcerts/peopleweb.discovery.wmnet.c*" on the people* machines, deleting cert files from private repo and anothe" [puppet] - 10https://gerrit.wikimedia.org/r/1012749 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:05:25] (03CR) 10Dzahn: [C:03+2] ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:05:32] (03PS2) 10Dzahn: ssl: delete peopleweb cert, replaced by cfssl provided cert [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) [19:07:40] (KubernetesRsyslogDown) resolved: (2) rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:14:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P58843 and previous config saved to /var/cache/conftool/dbconfig/20240320-191412-root.json [19:16:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P58844 and previous config saved to /var/cache/conftool/dbconfig/20240320-191649-root.json [19:18:10] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:21:14] !log stoping Prometheus on all instances to remediate Thanos Sidecar issues. [19:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:18] !log stopping the Prometheus service on all Prometheus instances to remediate Thanos Sidecar issues - T354399 [19:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:22] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [19:22:55] (03PS1) 10Tchanders: Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) [19:26:30] !log Moving the WAL directory to start with a fresh WAL - T354399 [19:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:43] (03CR) 10CI reject: [V:04-1] Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [19:26:46] !log Starting the Prometheus service - T354399 [19:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:12] (03PS1) 10Zabe: Add inline background color [extensions/Linter] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012781 (https://phabricator.wikimedia.org/T359205) [19:29:59] (03PS2) 10Zabe: Add inline background color [extensions/Linter] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012781 (https://phabricator.wikimedia.org/T359205) [19:33:52] (03CR) 10Majavah: [C:03+2] O:puppetserver: enable openstack stale cert exporter [puppet] - 10https://gerrit.wikimedia.org/r/1013039 (owner: 10Majavah) [19:45:05] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@49dac10]: (no justification provided) [19:45:13] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@49dac10]: (no justification provided) (duration: 00m 08s) [19:47:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:47:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:49:17] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@49dac10]: (no justification provided) [19:49:23] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@49dac10]: (no justification provided) (duration: 00m 05s) [19:52:18] hey folks, here to chat about T360565 if that's helpful [19:52:19] T360565: MediaWiki\Linter\MissingCategoryException: Cannot find id for 'night-mode-unaware-background-color' - https://phabricator.wikimedia.org/T360565 [19:54:08] I'm also here for web if there's any way we can be helpful here [19:54:34] I think the next step is to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Linter/+/1012781 to wmf.22 ? [19:54:35] (03CR) 10Arlolra: [C:03+1] Add inline background color [extensions/Linter] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012781 (https://phabricator.wikimedia.org/T359205) (owner: 10Zabe) [19:54:45] (03CR) 10C. Scott Ananian: [C:03+1] Add inline background color [extensions/Linter] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012781 (https://phabricator.wikimedia.org/T359205) (owner: 10Zabe) [19:55:07] dancy, hashar: train related ^ [19:55:08] Agreed - I'm not able to do that (sorry), but we can find someone else to if you need to go back to the offsite? [19:55:20] cscott: 19:55:08 Agreed - I'm not able to do that (sorry), but we can find someone else to if you need to go back to the offsite? [19:55:27] Missed while you disconnected [19:56:21] sorry, i'm out of sync with the train and backport schedule.  should we wait until the next backport window, or do we want to get it done now in order to unblock the train? [19:56:40] Best to wait for releng [19:56:53] I've pinged dancy [19:56:58] another option is to push through to group 1, ignoring the logspam, but get the patch backported before group2 so that we don't get logspam during tomorrow's train [19:58:24] we're scheduled to go on a hike in an hour on the offsite, so i'm biased toward either getting it all done asap or else taking a step back and waiting until the usual windows. [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T2000). nyaa~ [20:00:05] RoanKattouw, Jdlrobson, and Jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:46] o/ is the window blocked? [20:01:10] cscott: If you have train blocker fix to deploy then you should go ahead IMO [20:01:15] cscott: needs a deployment for a train blocker [20:01:51] I haven't done a train deploy for ~5 years i think, but i'm here for moral support. [20:02:00] RoanKattouw: are you able to help cscott get the backport done? [20:02:15] cscott: if you need a backport, this window is perfect for it [20:02:16] OK if you have a link to a patch I can deploy it [20:02:28] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Linter/+/1012781 [20:02:39] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Linter/+/1012781 yup [20:02:52] backported to wmf.22, which ought to unblock the train rolling forward for wmf.23 [20:03:19] subbu arlo and i can verify that the wmf.22 backport works [20:03:36] (we're all sitting together in the kitchen at the wmf offices, yay offsite) [20:03:53] (Thank you content transform team! Sorry about the disruption to your offsite!!) [20:04:08] cscott: I'm also here but I'm in a meeting so I'm in R102 [20:04:24] I can join you in person at 1:30 but in the meantime I'll deploy your patch [20:05:02] it'll probably take that long for CI and scap [20:05:09] thank goodness for irc or how would we manage to communicate [20:05:19] (thank you also Roan!!) [20:06:09] I'm confused, why is a backport to wmf.22 a train blocker? Aren't we trying to go from wmf.22 to wmf.23? [20:06:31] I'll +2 it to get the process going, but I'd like to understand that [20:06:34] (03CR) 10Catrope: [C:03+2] Add inline background color [extensions/Linter] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012781 (https://phabricator.wikimedia.org/T359205) (owner: 10Zabe) [20:06:36] short story, parsoid in wmf.23 is reporting a new lint error which the Linter extension in wmf.22 doesn't know about [20:06:57] Aaah and this caused rollback issues due to stored data [20:06:57] ? [20:06:59] simplest solution is to teach linter in wmf.22 about it, which avoids transient logspam during the deploy [20:07:03] Gotcha [20:07:33] mostly logspam caused by newer wmf.23 parsoids talking to not-yet-deployed wmf.22 servers, and sre was worried that the logspam might obscure any *actual* train errors. [20:07:52] 10ops-eqiad, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9647266 (10RKemper) a:03Jclark-ctr [20:07:59] in theory we could just push through the train and the situation ought to magically resolve itself, but this was is a bit safer [20:08:17] 10ops-eqiad, 10decommission-hardware: decommission wdqs100[6-8] - https://phabricator.wikimedia.org/T353845#9647280 (10RKemper) Had forgotten to properly assign dc-ops as well as tag for the DC. Straightened that out now, so this should be ready for dc-ops to do the decom. [20:09:26] (03Merged) 10jenkins-bot: Add inline background color [extensions/Linter] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1012781 (https://phabricator.wikimedia.org/T359205) (owner: 10Zabe) [20:09:40] (03CR) 10Catrope: [C:03+2] htmlform: Fix double escaping in Label div [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012768 (https://phabricator.wikimedia.org/T360381) (owner: 10Catrope) [20:12:56] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:13:03] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:15:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:15:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:19:22] RoanKattouw let us know when scap is done i guess? [20:19:34] Sorry only now starting it, I didn't get pinged when the patch merged [20:20:01] !log catrope@deploy2002 Started scap: Backport for [[gerrit:1012781|Add inline background color (T359205 T360565)]] [20:20:03] Beware: Deployments are taking about 15 minutes these days. [20:20:10] T359205: Deployment Ticket - Linting Rule for Template Style Fixes - https://phabricator.wikimedia.org/T359205 [20:20:11] T360565: MediaWiki\Linter\MissingCategoryException: Cannot find id for 'night-mode-unaware-background-color' - https://phabricator.wikimedia.org/T360565 [20:22:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:23:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:24:29] (03PS1) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1013139 [20:24:39] (03CR) 10Mainframe98: [C:03+1] GerritBot: Avoid Phabricator auto-linking Gerrit change numbers [puppet] - 10https://gerrit.wikimedia.org/r/1013097 (https://phabricator.wikimedia.org/T358940) (owner: 10Aklapper) [20:26:24] (03Abandoned) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins) [20:26:30] (03Merged) 10jenkins-bot: htmlform: Fix double escaping in Label div [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1012768 (https://phabricator.wikimedia.org/T360381) (owner: 10Catrope) [20:27:26] (SystemdUnitFailed) firing: wmf_auto_restart_nginx.service on apt2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:38] (03CR) 10Ssingh: [C:03+1] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1013139 (owner: 10CDobbins) [20:30:51] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:32:47] o/ [20:33:28] ^ quite possibly continuation from earlier (and not really actionable)? but I don't actually know [20:33:59] kamila_: thanks, from earlier, you mean the switchover, or something else? [20:34:37] jhathaway: there's an issue with restbase, not related to switchover, just stuff not getting processed [20:34:54] ah, thanks [20:35:00] dancy: I'm noticing that now. The step to build k8s images just took 11 minutes for me, and pulling those images is taking over 4 minutes already [20:35:20] RoanKattouw: any problems with pushing my config patch too? [20:35:40] Jdlrobson: I'll do yours once I get to it (after Scott's and then mine), it's just that every patch takes forever now [20:35:47] When dancy said 15 minutes he was lowballing it [20:35:51] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:35:52] Roan: Hmm.. that sounds like a different issue. Were any l10n files touched ? [20:36:01] yes alas [20:36:03] Yep no problem just didn't want to assume. I have time for getting a cup of tea which was the main reason for my question :) [20:36:10] dancy: They were, yes, but the step where I would have expected that to show up was fast [20:36:26] Can you send me the transcript? [20:36:38] https://www.irccloud.com/pastebin/7kGFwY4D/ [20:36:45] That's the l10n part [20:36:58] nod. 505 langs rebuilt == gigabytes [20:37:00] Here's the slow part: [20:37:05] https://www.irccloud.com/pastebin/g4bDVm25/ [20:37:21] Oooh, I see, it says "0 out of 505" the second time but "505 out of 505" the first time, I missed that [20:37:43] I guess it took only 43 seconds to build the JSON data but much longer to build+fetch the k8s images [20:37:44] jhathaway: if it is indeed the earlier problem, then the summary for that one is that ideally we'd ping content transform because it's a software problem, but they're at an offsite, so we may try to look at it some more tomorrow [20:38:02] kamila_: nod, thanks [20:41:36] kamila_: content transformation are the guys deploying right now aren't they [20:41:39] cscott: ^ [20:41:47] yup [20:41:51] oh, yes, hi! [20:41:56] autobots assemble [20:43:18] daniel and yiannis are the restbase folks though and they are already off on their afternoon activities i believe [20:43:55] yeah, I don't really want to bother you folks when you're on an offsite [20:44:29] Oh yeah here we go, the data is just huge: [20:44:29] 20:42:00 rsync transfer: average 2,998,347,537 bytes/host, total 11,993,390,148 bytes [20:44:54] I was quite the opposite of bored on our offsite last week even without having to fix stuff :D [20:45:34] !log catrope@deploy2002 catrope and zabe: Backport for [[gerrit:1012781|Add inline background color (T359205 T360565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:45:42] T359205: Deployment Ticket - Linting Rule for Template Style Fixes - https://phabricator.wikimedia.org/T359205 [20:45:42] T360565: MediaWiki\Linter\MissingCategoryException: Cannot find id for 'night-mode-unaware-background-color' - https://phabricator.wikimedia.org/T360565 [20:47:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:47:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:47:50] !log catrope@deploy2002 catrope and zabe: Continuing with sync [20:47:53] we're good to go [20:48:26] it looks good on the canaries on enwiki (it's also already running on group0 already so this is not surprising) [20:49:59] Woohoo! [20:50:23] Thanks for your help everyone! [20:50:26] I'll roll forward now [20:50:50] actually.. checking something. [20:51:14] No backport needed for .23? [20:51:33] cscott: ^ [20:53:10] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240320T2100) [21:01:09] RoanKattouw: No .23 backport needed? [21:01:10] dancy: My deploy is still not finished so hold on [21:01:18] No there's no .23 backport needed [21:01:23] OK. thanks [21:01:33] I was also confused but Scott explained it to me [21:02:03] The .22 patch backfills something so that stuff generated by .23 doesn't confuse the code in .22 during deploy transientness or if we roll back [21:02:17] Gotcha. [21:02:59] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:1012781|Add inline background color (T359205 T360565)]] (duration: 42m 57s) [21:03:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:03:04] T359205: Deployment Ticket - Linting Rule for Template Style Fixes - https://phabricator.wikimedia.org/T359205 [21:03:04] T360565: MediaWiki\Linter\MissingCategoryException: Cannot find id for 'night-mode-unaware-background-color' - https://phabricator.wikimedia.org/T360565 [21:03:06] And then when this one finishes I have two more patches :( [21:03:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:03:54] Ok. Ping me when it's time for me to take over [21:04:40] !log catrope@deploy2002 Started scap: Backport for [[gerrit:1012768|htmlform: Fix double escaping in Label div (T360381)]] [21:04:43] T360381: Secure login link on Special:UserLogin gets double-escaped - https://phabricator.wikimedia.org/T360381 [21:04:51] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:05:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:06:25] \o [21:06:32] oh, actually, I can't read, this doesn't necessariy look related to the restbase problem from earlier [21:07:07] <_joe_> which restbase problem? [21:07:07] !log catrope@deploy2002 catrope: Backport for [[gerrit:1012768|htmlform: Fix double escaping in Label div (T360381)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:16] ok, thanks kamila_ [21:07:20] <_joe_> this seems like a slightly higher than usual number of timeouts [21:07:57] <_joe_> you can try to look at the rest gateway telemetry to find out what's timing out [21:08:01] eh what is wrong with my brain, s/restbase/changeprop [21:08:01] <_joe_> given it's 504s [21:08:17] * kamila_ will stop adding noise now [21:08:20] <_joe_> kamila_: ah ok, because I think there is potentially a connection [21:08:30] oh, okay! I am smart and totally see it! [21:08:31] :D [21:08:39] well, changeprop was restbase, earlier [21:08:48] !log catrope@deploy2002 catrope: Continuing with sync [21:08:51] it was a restbase error bubbling up through changeprop [21:09:36] <_joe_> kamila_: uh wait, changeprop died again and needed to be restarted? [21:09:51] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:10:25] _joe_: I don't think so, I think we left it alone after the one restart [21:10:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:11:28] "look at the rest gateway telemetry", besides the grafana dashboard, any other places to look _joe_? [21:11:30] The earlier changeprop/restbase error I'm aware of from earlier (in case I missed something else), was this: https://phabricator.wikimedia.org/T360548 [21:11:46] <_joe_> jhathaway: no just the grafana dashboard [21:11:51] this is something else (something other than that) [21:11:54] nod, thanks [21:13:57] <_joe_> jhathaway: i think this is the answer [21:13:59] <_joe_> https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=rest-gateway&from=now-24h&to=now&viewPanel=12 [21:14:00] this is something bubbling up errors through restbase, whatever it uses for feeds [21:14:12] <_joe_> urandom: yes [21:14:28] <_joe_> so wikifeeds has a slightly higher than usual rate of timeouts [21:14:48] <_joe_> this might fire again [21:14:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:15:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:15:22] * jhathaway looks for wikifeeds telemetry [21:15:27] <_joe_> zooming out, there is an obvious issue [21:15:29] <_joe_> https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=rest-gateway&from=now-30d&to=now&viewPanel=12 [21:16:13] <_joe_> started on the 10th [21:16:28] <_joe_> and only in eqiad [21:16:39] <_joe_> which I think circles back to urandom's task [21:17:22] I don't know how wikifeeds works, but I think this is different [21:17:55] <_joe_> ah actually no, i think wikifeeds could use more replicas [21:18:10] the errors associated with T360548 are not currently happening [21:18:10] T360548: Cassandra quorum read timeouts during node decommissions - https://phabricator.wikimedia.org/T360548 [21:18:13] replicas? [21:18:23] <_joe_> kubernetes pods [21:18:34] oh I see, so it's a capacity issue? [21:18:53] <_joe_> I think so, because the request rate is a bit higher in eqiad [21:19:05] <_joe_> but I won't do anything about it at 10 pm :) [21:19:07] I can fix that [21:19:20] thanks for the pointer _joe_ , now go away :D [21:19:28] <_joe_> kamila_: let's do it tomorrow morning, I suspect the problem is deeper maybe [21:19:32] (and also, that graph is terrifying XD) [21:19:33] fair [21:20:27] I don't see much of a bump in requests on march 10th, but that is when latency increased [21:20:28] that graph is terrifying [21:20:34] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:1012768|htmlform: Fix double escaping in Label div (T360381)]] (duration: 15m 54s) [21:20:40] T360381: Secure login link on Special:UserLogin gets double-escaped - https://phabricator.wikimedia.org/T360381 [21:21:00] https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=rest-gateway&from=now-30d&to=now&viewPanel=12 I mean [21:21:09] what changed on the 10th to cause that? [21:21:29] (03PS3) 10Catrope: Make night theme available on shwiki, exclude additional actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [21:21:31] (03CR) 10Catrope: [C:03+2] Make night theme available on shwiki, exclude additional actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [21:21:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [21:22:13] (03Merged) 10jenkins-bot: Make night theme available on shwiki, exclude additional actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012755 (https://phabricator.wikimedia.org/T359183) (owner: 10Jdlrobson) [21:22:21] anyone know how to see what commit or version of wikifeeds is deployed [21:22:23] <_joe_> https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&var-dc=thanos&var-site=eqiad&var-service=wikifeeds&var-prometheus=k8s&var-container_name=All&refresh=30s&from=now-30d&to=now&viewPanel=20 makes me think it's aqs? [21:22:37] !log catrope@deploy2002 Started scap: Backport for [[gerrit:1012755|Make night theme available on shwiki, exclude additional actions (T359183 T359152)]] [21:22:42] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [21:22:42] T359152: Deploy initial version of night mode to pilot wikis on the mobile website for testing - https://phabricator.wikimedia.org/T359152 [21:22:50] they upgraded to node18, recently, https://phabricator.wikimedia.org/T358017, though I don't know if it is deployed [21:22:51] <_joe_> jhathaway: you can see which version of the image is [21:22:57] nod [21:23:06] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:23:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:24:17] does wikifeeds scape data from aqs? [21:25:02] <_joe_> pods have been running for 42 days [21:25:04] !log catrope@deploy2002 catrope and jdlrobson: Backport for [[gerrit:1012755|Make night theme available on shwiki, exclude additional actions (T359183 T359152)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:25:07] <_joe_> so much longer [21:25:45] Jdlrobson: Your change is finally on the test servers, please test [21:26:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:27:06] RoanKattouw: I can test the change [21:27:42] RoanKattouw: please sync! [21:28:07] !log catrope@deploy2002 catrope and jdlrobson: Continuing with sync [21:28:09] yup looks good [21:31:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:38:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:38:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:38:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 28.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:39:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 888.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:39:48] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:1012755|Make night theme available on shwiki, exclude additional actions (T359183 T359152)]] (duration: 17m 10s) [21:39:52] T359183: Exclude non-functional pages from night mode - https://phabricator.wikimedia.org/T359183 [21:39:53] T359152: Deploy initial version of night mode to pilot wikis on the mobile website for testing - https://phabricator.wikimedia.org/T359152 [21:40:14] dancy: Alright I'm finally done, sorry it took so long [21:40:28] No problem. [21:40:41] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013144 (https://phabricator.wikimedia.org/T354441) [21:40:43] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013144 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [21:41:24] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013144 (https://phabricator.wikimedia.org/T354441) (owner: 10TrainBranchBot) [21:43:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:44:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:45:26] (03PS1) 10Dzahn: requesttracker: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013145 (https://phabricator.wikimedia.org/T360413) [21:45:28] (03PS1) 10Dzahn: etherpad: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013146 (https://phabricator.wikimedia.org/T360413) [21:45:29] (03PS1) 10Dzahn: releases: switch SSL cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013147 (https://phabricator.wikimedia.org/T360413) [21:45:41] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [21:46:01] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [21:49:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:54:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 824.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:55:47] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.23 refs T354441 [21:55:51] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [22:00:51] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:01:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 823.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:05:51] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from rest-gateway.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:05:59] (03PS1) 10Ahmon Dancy: mediawiki.yaml: Use static.php to serve www.mediawiki.org/ontology/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) [22:06:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 823.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:07:13] (03CR) 10CI reject: [V:04-1] mediawiki.yaml: Use static.php to serve www.mediawiki.org/ontology/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [22:08:54] !log dancy@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.23 refs T354441 (duration: 13m 06s) [22:08:58] T354441: 1.42.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T354441 [22:10:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1013128 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [22:13:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:14:41] (03PS2) 10Ahmon Dancy: mediawiki.yaml: Use static.php to serve www.mediawiki.org/ontology/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) [22:18:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:28:40] (KubernetesRsyslogDown) firing: rsyslog on mw2406:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2406 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:42:14] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:57:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:10:17] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9647880 (10Ladsgroup) I decided to take a look at numbers of top hitters up to the point it would fill ATS (I went to top 5m objects until it borked). Unfortun... [23:33:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:33:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:37:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:37:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:41:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:41:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:47:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:47:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply