[00:18:04] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:06] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/937593 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/937593 (owner: 10TrainBranchBot) [00:54:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/937593 (owner: 10TrainBranchBot) [02:03:36] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:11:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:04] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:12:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:24] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:29:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:47:05] (03PS7) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [05:48:45] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [05:53:54] (03PS1) 10Giuseppe Lavagetto: Fix max upload size in php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/938027 (https://phabricator.wikimedia.org/T341825) [05:57:22] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix max upload size in php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/938027 (https://phabricator.wikimedia.org/T341825) (owner: 10Giuseppe Lavagetto) [05:57:32] (03PS2) 10Giuseppe Lavagetto: Fix max upload size in php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/938027 (https://phabricator.wikimedia.org/T341825) [05:57:37] (03CR) 10Giuseppe Lavagetto: [V: 03+2] Fix max upload size in php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/938027 (https://phabricator.wikimedia.org/T341825) (owner: 10Giuseppe Lavagetto) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230714T0600) [06:16:44] !log oblivian@deploy1002 Started scap: (no justification provided) [06:23:10] (03PS8) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [06:24:01] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:25:20] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [06:26:11] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:26:20] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:28:22] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:43:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42467/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [06:44:50] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10SLyngshede-WMF) We do have sort of a work-around, which is currently for review. We let the IDM call the createUser api on mediawiki, so tha... [06:58:53] (03PS1) 10Giuseppe Lavagetto: mw-debug: add ini values dumper [deployment-charts] - 10https://gerrit.wikimedia.org/r/938030 (https://phabricator.wikimedia.org/T341825) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230714T0700) [07:01:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug: add ini values dumper [deployment-charts] - 10https://gerrit.wikimedia.org/r/938030 (https://phabricator.wikimedia.org/T341825) (owner: 10Giuseppe Lavagetto) [07:02:40] (03Merged) 10jenkins-bot: mw-debug: add ini values dumper [deployment-charts] - 10https://gerrit.wikimedia.org/r/938030 (https://phabricator.wikimedia.org/T341825) (owner: 10Giuseppe Lavagetto) [07:04:47] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:04:57] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:06:39] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:06:42] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:06:51] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [07:07:22] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [07:12:38] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:13:31] !log hashar@deploy1002 Started deploy [integration/docroot@56b5745]: Add mwbot-rs to doc.wikimedia.org - T341543 [07:13:34] T341543: Publish mwbot-rs docs on doc.wikimedia.org - https://phabricator.wikimedia.org/T341543 [07:13:39] !log hashar@deploy1002 Finished deploy [integration/docroot@56b5745]: Add mwbot-rs to doc.wikimedia.org - T341543 (duration: 00m 08s) [07:16:30] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:19:37] (03PS9) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [07:22:18] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:54] (03CR) 10Vgutierrez: hiera: apply silent-drop on port 80 to all eqsin cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [07:26:45] (03PS6) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [07:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:29:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:37:40] (03PS7) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [07:40:13] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42468/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [07:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:08:14] (03CR) 10Vgutierrez: "applying this for a single host didn't require duplicating all the tls config, and it should be the same for the whole DC" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:10:26] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dcaro) [08:11:51] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dcaro) [08:15:34] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) @Arnoldokoth thanks for testing this. You are talking about the your individual profile settings in https://gitlab-replica.... [08:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:24:13] (03CR) 10Fabfur: [V: 03+1] hiera: apply silent-drop on port 80 to all eqsin cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:26:13] (03PS8) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [08:27:04] (03CR) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:28:44] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42469/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:34:15] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) @Jelto No not yet, and I just tried to login still the same error. I can see how it might work if the accounts are... [08:38:55] (03PS9) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [08:40:25] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42470/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:47:00] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [08:47:36] <_joe_> !log deploying to mw on k8s for T341825 [08:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:39] T341825: Max upload size on k8s is 2M - https://phabricator.wikimedia.org/T341825 [08:48:31] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [08:48:46] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [08:51:05] (03PS1) 10Peter Fischer: Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) [08:51:20] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [08:53:15] (03CR) 10Peter Fischer: "I only bumped the version of the extra plugin (core) for now" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [08:54:18] (03PS1) 10JMeybohm: Fix detection of changes in subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/938211 [08:55:21] (03CR) 10CI reject: [V: 04-1] Fix detection of changes in subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/938211 (owner: 10JMeybohm) [09:01:18] (03PS1) 10Peter Fischer: Add hint for GPG signatures [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938212 [09:02:49] !log Setting ores2003 to pooled=inactive wheile we attempt repairs/decide on decom [09:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:54] !log klausman@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=ores2003.codfw.wmnet [09:03:53] (03CR) 10Peter Fischer: "It took a moment for me to figure out the procedure, so I added some documentation for future newbies." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938212 (owner: 10Peter Fischer) [09:04:26] (03PS2) 10Filippo Giunchedi: blubber: add buildkit syntax directive [alerts] - 10https://gerrit.wikimedia.org/r/937998 (owner: 10Gmodena) [09:04:44] (03CR) 10Filippo Giunchedi: [C: 03+2] blubber: add buildkit syntax directive (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/937998 (owner: 10Gmodena) [09:05:23] (03CR) 10CI reject: [V: 04-1] blubber: add buildkit syntax directive [alerts] - 10https://gerrit.wikimedia.org/r/937998 (owner: 10Gmodena) [09:05:41] :( [09:06:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Fix detection of changes in subcharts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938211 (owner: 10JMeybohm) [09:06:19] (03CR) 10Filippo Giunchedi: [C: 03+2] "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/937998 (owner: 10Gmodena) [09:07:23] (03Merged) 10jenkins-bot: blubber: add buildkit syntax directive [alerts] - 10https://gerrit.wikimedia.org/r/937998 (owner: 10Gmodena) [09:13:17] (03PS16) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:17:13] (03CR) 10JMeybohm: Testing hack: Update ipoid to certmanager (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:17:29] (03CR) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:21:34] (03CR) 10Btullis: [C: 03+1] Fix detection of changes in subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/938211 (owner: 10JMeybohm) [09:25:36] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "Force pass CI as this will fail until datahub is fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938211 (owner: 10JMeybohm) [09:31:12] (03PS1) 10JMeybohm: CI: Run envoy validation with service-node and service-cluster set [deployment-charts] - 10https://gerrit.wikimedia.org/r/938213 (https://phabricator.wikimedia.org/T300033) [09:32:11] (03CR) 10Ayounsi: Manage TLS on network devices (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [09:32:13] (03CR) 10CI reject: [V: 04-1] CI: Run envoy validation with service-node and service-cluster set [deployment-charts] - 10https://gerrit.wikimedia.org/r/938213 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:32:42] (03PS3) 10JMeybohm: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) [09:32:44] (03PS4) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [09:32:46] (03PS4) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [09:32:48] (03PS4) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [09:33:04] (03PS17) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:33:45] (03CR) 10CI reject: [V: 04-1] Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:33:47] (03CR) 10CI reject: [V: 04-1] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:33:53] (03CR) 10CI reject: [V: 04-1] Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:33:56] (03CR) 10CI reject: [V: 04-1] Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:34:49] (03CR) 10Jbond: [C: 03+1] dnsrecursor: allow configuring the webserver loglevel [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh) [09:35:31] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [09:36:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "Force pass CI as this will fail until datahub is fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938213 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:39:55] (03PS5) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [09:39:57] (03PS5) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [09:39:59] (03PS5) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [09:40:01] (03PS1) 10Btullis: Add missing global values to the datahub subcharts to fix CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/938214 (https://phabricator.wikimedia.org/T329514) [09:41:46] (03PS18) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:44:00] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [09:45:03] (03CR) 10JMeybohm: "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938214 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:45:11] (03CR) 10JMeybohm: [C: 03+1] Add missing global values to the datahub subcharts to fix CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/938214 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:53:01] (03PS19) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:55:20] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [09:57:38] (03PS20) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:59:59] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [10:07:36] (03PS21) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [10:08:16] (03CR) 10JMeybohm: [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [10:09:29] (03PS10) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [10:09:56] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [10:12:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: Bye bye nutcracker! [deployment-charts] - 10https://gerrit.wikimedia.org/r/938001 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [10:14:53] (03PS22) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [10:17:02] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 10 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42471/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:17:12] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [10:17:28] (03CR) 10Effie Mouzeli: [C: 03+1] images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [10:17:59] (03PS1) 10Jbond: pki: Add new network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938217 [10:18:22] (03CR) 10CI reject: [V: 04-1] pki: Add new network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938217 (owner: 10Jbond) [10:20:47] (03PS2) 10Jbond: pki: Add new network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938217 [10:21:08] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) >>! In T341488#9012959, @MatthewVernon wrote: > It might not be possible, but if we could end up with the `thanos-fe*` nodes running `swift:... [10:21:13] (03PS5) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 [10:21:15] (03PS7) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [10:21:17] (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [10:21:19] (03PS5) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [10:21:27] (03PS11) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [10:22:11] (03CR) 10Jbond: [C: 03+2] pki: Add new network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938217 (owner: 10Jbond) [10:24:08] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#9013800, @Papaul wrote: > @cmooney I am ok moving the server when it is ready. We can move it to... [10:25:13] (03CR) 10Btullis: [C: 03+2] Add missing global values to the datahub subcharts to fix CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/938214 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:25:20] (03CR) 10CI reject: [V: 04-1] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:26:08] (03Merged) 10jenkins-bot: Add missing global values to the datahub subcharts to fix CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/938214 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:28:22] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:37] (03PS1) 10Jbond: pki: add network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) [10:35:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42474/console" [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) (owner: 10Jbond) [10:38:01] (03CR) 10Jbond: pki: add network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) (owner: 10Jbond) [10:38:04] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:38:42] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:39:13] (03PS8) 10Kamila Součková: add Benthos chart + WIP cache invalidator service [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) [10:39:43] (03PS12) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [10:40:01] (03CR) 10CI reject: [V: 04-1] add Benthos chart + WIP cache invalidator service [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [10:40:36] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [10:41:12] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [10:42:31] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [10:43:19] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [10:43:29] (03CR) 10Kaleem Bhatti: [C: 03+1] sdwiki: set 'wgTranslateNumerals' to false followed by https://phabricator.wikimedia.org/T296055 Bug: T268203 Change-Id: I2bac799647ac0c0c58 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [10:44:04] (03CR) 10Vgutierrez: [C: 03+1] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:45:51] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 9 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42475/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:53:52] (03PS23) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [11:04:14] (03PS9) 10Kamila Součková: add Benthos chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) [11:04:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:17] (03PS1) 10Cathal Mooney: admin: add Ifrah Khanyaree (WMDE) to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/938222 (https://phabricator.wikimedia.org/T341455) [11:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:55] (03CR) 10Ayounsi: [C: 03+1] "I don't know enough how that works for through review but seems straightforward. and +1 on the 1y and the name" [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) (owner: 10Jbond) [11:19:14] (03PS24) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [11:20:08] (03CR) 10Ayounsi: Manage TLS on network devices (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [11:27:58] (03PS14) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [11:28:34] (03CR) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [11:34:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Reedy) [11:36:07] (03PS1) 10ArielGlenn: add jebe and xcollazo to nagios command access [puppet] - 10https://gerrit.wikimedia.org/r/938226 (https://phabricator.wikimedia.org/T341045) [11:37:22] (03PS15) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [11:38:26] (03CR) 10JMeybohm: [C: 04-1] confd: allow running multiple instances (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [11:39:21] (03CR) 10Hnowlan: [C: 03+1] add Benthos chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [11:41:33] (03CR) 10Hnowlan: [C: 03+2] images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [11:50:12] (03Merged) 10jenkins-bot: images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [11:51:47] (03PS16) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [11:58:06] (03PS17) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [11:58:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I think we have a pretty solid test coverage, and the generated ruleset should be both efficient and elegant for now, until we start looki" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [12:00:46] (03PS18) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [12:01:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [12:01:49] (03PS4) 10Ayounsi: Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) [12:03:55] (03PS1) 10Hnowlan: thumbor: enable debug logging for memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/938227 (https://phabricator.wikimedia.org/T341805) [12:06:01] (03CR) 10Ayounsi: [C: 03+2] Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [12:06:51] (03CR) 10JMeybohm: add Benthos chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:09:55] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) >>! In T341494#9009786, @cmooney wrote: > > Unless you’re considering it like a cluster of 5, and how to spread over the 4... [12:10:25] (03PS1) 10Filippo Giunchedi: udp2log: run mw-log-cleanup after logrotate [puppet] - 10https://gerrit.wikimedia.org/r/938228 (https://phabricator.wikimedia.org/T341691) [12:12:18] (03CR) 10JMeybohm: add Benthos chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:16:04] (03CR) 10Kamila Součková: add Benthos chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:25:31] (03CR) 10JMeybohm: add Benthos chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:28:33] 10SRE, 10ops-codfw, 10decommission-hardware: decommission krb2001.codfw.wmnet - https://phabricator.wikimedia.org/T340433 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [12:36:52] (03Abandoned) 10Ayounsi: Spicerack: add some colors [software/spicerack] - 10https://gerrit.wikimedia.org/r/924493 (owner: 10Ayounsi) [12:41:19] (03PS1) 10Arturo Borrero Gonzalez: eqiad1: decomission cloudcontrol1005.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/938235 (https://phabricator.wikimedia.org/T341495) [12:41:22] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/931609 (owner: 10Ayounsi) [12:42:37] (03CR) 10Ayounsi: [C: 03+2] Remove trusted-space [homer/public] - 10https://gerrit.wikimedia.org/r/931609 (owner: 10Ayounsi) [12:43:14] (03Merged) 10jenkins-bot: Remove trusted-space [homer/public] - 10https://gerrit.wikimedia.org/r/931609 (owner: 10Ayounsi) [13:00:19] (03PS1) 10Arturo Borrero Gonzalez: wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 [13:02:09] (03PS2) 10Jbond: pki: add network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) [13:03:04] (03PS3) 10Jbond: pki: add network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) [13:03:44] (03CR) 10Jbond: "ill merge this on monday" [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) (owner: 10Jbond) [13:05:14] (03PS2) 10Jelto: buildkitd: Fix gckeepstorage units [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) (owner: 10Dduvall) [13:06:33] (03PS2) 10Arturo Borrero Gonzalez: wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 [13:06:56] (03CR) 10CI reject: [V: 04-1] wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 (owner: 10Arturo Borrero Gonzalez) [13:07:46] (03PS3) 10Arturo Borrero Gonzalez: wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 [13:10:29] (03PS3) 10Jelto: buildkitd: Fix gckeepstorage units [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) (owner: 10Dduvall) [13:12:23] (03PS4) 10Arturo Borrero Gonzalez: wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 [13:13:57] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42481/console" [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) (owner: 10Dduvall) [13:14:15] (03PS5) 10Arturo Borrero Gonzalez: wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 [13:15:37] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/938238/42482/" [puppet] - 10https://gerrit.wikimedia.org/r/938238 (owner: 10Arturo Borrero Gonzalez) [13:16:36] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, I fixed two minor puppet bugs." [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) (owner: 10Dduvall) [13:17:13] (03CR) 10Jelto: [V: 03+1 C: 03+2] buildkitd: Fix gckeepstorage units [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) (owner: 10Dduvall) [13:23:01] (03CR) 10Jelto: [V: 03+1 C: 03+2] "diff on wmcs runners:" [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) (owner: 10Dduvall) [13:30:04] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) For OIDC via CAS in our Python applications we're relying on a special OIDC backend https://github.com/python-soci... [13:30:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) [13:30:48] 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [13:34:33] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) The fixes that worked a little, but failed to provide correct user information was to set the following in CAS:... [13:34:59] PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:40:29] RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:46:19] (03PS1) 10Elukey: ml-services: increase min/max pods for revscoring damaging and goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/938241 (https://phabricator.wikimedia.org/T341479) [13:52:17] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: increase min/max pods for revscoring damaging and goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/938241 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [13:53:41] (03PS2) 10Elukey: ml-services: increase min/max pods for revscoring damaging and goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/938241 (https://phabricator.wikimedia.org/T341479) [13:55:16] (03CR) 10Elukey: "Ilias I reworked a bit the change, lemme know if you still like it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938241 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:07:45] (03CR) 10Kamila Součková: add Benthos chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:08:23] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:37] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: increase min/max pods for revscoring damaging and goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/938241 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:15:18] (03CR) 10JMeybohm: add Benthos chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:16:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:17:09] 10SRE-swift-storage, 10Commons: Uploading large files to Commons almost always fails - https://phabricator.wikimedia.org/T340901 (10Hoi) Sometimes, after failed uploads, I got `[...] 202?-??-?? ??:??:??: Fatal exception of type "Wikimedia\RequestTimeout\RequestTimeoutException` when accessing [[c:Special:Uploa... [14:17:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 1.081 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:18:22] (03CR) 10Elukey: [C: 03+2] ml-services: increase min/max pods for revscoring damaging and goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/938241 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:18:23] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:25:24] (03PS10) 10Kamila Součková: add Benthos chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) [14:25:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:26:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:26:42] (03CR) 10Kamila Součková: add Benthos chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:31:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:33:22] (03PS1) 10Elukey: ml-services: set scaling defaults for goodfaith/damaing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938251 [14:39:20] (03PS1) 10Elukey: admin_ng: set better resourcequotas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 [14:39:48] (03CR) 10Elukey: [C: 03+2] ml-services: set scaling defaults for goodfaith/damaing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938251 (owner: 10Elukey) [14:42:07] (03CR) 10JHathaway: [C: 03+1] "seems reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [14:44:20] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [14:47:13] (03CR) 10Kamila Součková: [C: 03+2] add Benthos chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:51:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:52:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:52:22] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:59:23] (03CR) 10Ebernhardson: [V: 03+2 C: 03+2] Add hint for GPG signatures [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938212 (owner: 10Peter Fischer) [15:02:15] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/938238 (owner: 10Arturo Borrero Gonzalez) [15:03:32] (03PS1) 10Kamila Součková: add WIP Benthos cache invalidator to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [15:04:11] (03CR) 10CI reject: [V: 04-1] add WIP Benthos cache invalidator to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [15:05:28] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) p:05Triage→03Medium a:03cmooney [15:06:20] (03PS2) 10Kamila Součková: add WIP Benthos cache invalidator to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [15:15:35] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia Enterprise: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10JArguello-WMF) [15:17:08] (03CR) 10Klausman: [C: 03+1] admin_ng: set better resourcequotas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 (owner: 10Elukey) [15:23:11] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set better resourcequotas for ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 (owner: 10Elukey) [15:26:45] (03CR) 10Elukey: admin_ng: set better resourcequotas for ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 (owner: 10Elukey) [15:27:05] going afk for the weekend folks! [15:27:13] have a good rest of the day [15:27:16] (and weekend) [15:27:56] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set better resourcequotas for ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 (owner: 10Elukey) [15:33:23] (03PS1) 10Hnowlan: rest-gateway: add routes for wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) [15:34:02] (03CR) 10CI reject: [V: 04-1] rest-gateway: add routes for wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [15:38:52] (03PS2) 10Hnowlan: rest-gateway: add routes for wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) [15:45:24] (03CR) 10Hnowlan: [C: 03+2] thumbor: enable debug logging for memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/938227 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [15:46:08] (03Merged) 10jenkins-bot: thumbor: enable debug logging for memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/938227 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [15:46:15] (03PS3) 10Kamila Součková: add WIP Benthos cache invalidator to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [15:48:38] (03PS4) 10Kamila Součková: add WIP Benthos cache invalidator to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [15:49:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "merging this next week if nobody beats me to it." [puppet] - 10https://gerrit.wikimedia.org/r/938238 (owner: 10Arturo Borrero Gonzalez) [15:59:08] 10SRE, 10ops-codfw, 10decommission-hardware: decommission krb2001.codfw.wmnet - https://phabricator.wikimedia.org/T340433 (10Jhancock.wm) [16:00:05] 10SRE, 10ops-codfw, 10decommission-hardware: decommission krb2001.codfw.wmnet - https://phabricator.wikimedia.org/T340433 (10Jhancock.wm) 05Open→03Resolved disk removed, moved to storage, and offline script run [16:04:49] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:05:05] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:08:54] (03CR) 10Dzahn: [C: 03+1] "lgmt, only nitpick is that they dont have email address set in LDAP. but I can confirm this follows the standard schema of WMDE email addr" [puppet] - 10https://gerrit.wikimedia.org/r/938222 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney) [16:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:19:40] (03PS1) 10Hnowlan: images: fix debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/938272 [16:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:26:40] 10SRE: Cannot download large files from commons - https://phabricator.wikimedia.org/T341755 (10Midleading) These files are uploaded at the request of another Wikimedia user. I have PDFs up to 8GB, but they will be splitted to parts below 4GB. Not having to use server-side upload is a huge achievement for Wikimed... [16:27:05] 10SRE: Cannot download large files from commons - https://phabricator.wikimedia.org/T341755 (10Midleading) p:05Triage→03Low [16:28:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:30:00] (03PS1) 10Hnowlan: Revert "thumbor: enable debug logging for memcached" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937924 [16:31:50] (03CR) 10Hnowlan: [C: 03+2] Revert "thumbor: enable debug logging for memcached" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937924 (owner: 10Hnowlan) [16:32:45] (03Merged) 10jenkins-bot: Revert "thumbor: enable debug logging for memcached" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937924 (owner: 10Hnowlan) [16:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:04:56] (03PS10) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [17:06:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:44] (03PS11) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [17:31:04] (03CR) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [17:37:35] PROBLEM - Check systemd state on cloudbackup2001 is CRITICAL: CRITICAL - degraded: The following units failed: dm-event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:21] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:41] (03PS1) 10Urbanecm: NewImpact: fix undefined log function [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938306 (https://phabricator.wikimedia.org/T341865) [18:13:53] (03PS1) 10Majavah: build: Fix printed image name [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938291 [18:13:59] (03PS1) 10Majavah: Add php82 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) [18:18:23] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:30:06] (03PS1) 10Jforrester: [WIP] service, k8s: Add service definitions for function-orchestrator and function-evaluator [puppet] - 10https://gerrit.wikimedia.org/r/938295 (https://phabricator.wikimedia.org/T297314) [18:30:13] (03CR) 10Ebernhardson: [C: 03+1] "lgtm" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [18:30:17] (03CR) 10Jforrester: [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:30:30] (03CR) 10CI reject: [V: 04-1] [WIP] service, k8s: Add service definitions for function-orchestrator and function-evaluator [puppet] - 10https://gerrit.wikimedia.org/r/938295 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:42:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1153.eqiad.wmnet with OS bullseye [18:42:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye [18:43:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:19:17] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@37d3ad6]: Run page_content_change_to_wikitext_raw DAG serially. T335860 [19:19:21] T335860: Implement job to transform mediawiki.page_content_change - https://phabricator.wikimedia.org/T335860 [19:19:32] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@37d3ad6]: Run page_content_change_to_wikitext_raw DAG serially. T335860 (duration: 00m 14s) [19:38:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Jclark-ctr I need your assistance next time you're onsite at Eqiad. These servers do not have a network connection on the 1st port of the NIC card. Can yo... [19:39:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-worker1153.eqiad.wmnet with OS bullseye [19:52:12] (03PS1) 10Jforrester: wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 [19:52:47] (03CR) 10CI reject: [V: 04-1] wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 (owner: 10Jforrester) [19:52:50] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 (owner: 10Jforrester) [19:53:17] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 (owner: 10Jforrester) [19:53:25] (03CR) 10CI reject: [V: 04-1] wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 (owner: 10Jforrester) [19:53:33] (03PS2) 10Jforrester: wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 [19:54:39] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 (owner: 10Jforrester) [19:55:17] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:55:19] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:55:29] (03Merged) 10jenkins-bot: wikifunctions: Add ENV to set Host: header in orchestrator MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/938299 (owner: 10Jforrester) [19:56:40] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:57:27] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:02:07] (03PS1) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 [20:05:38] (03CR) 10Cory Massaro: [C: 03+1] wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:31:35] (03CR) 10Daniel Kinzler: "This should only be deployed opnce we are sure we are not reverting the code change. Please give it a couple of weeks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [21:07:01] (03CR) 10BryanDavis: "One note inline about a config line that dropped that is not obviously unwanted and another about a thing that should be documented somewh" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) (owner: 10Majavah) [21:13:09] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Dzahn) So.. meanwhile the number of services still using 2.2 syntax has been reduced. And the one that actually affects appservers/mediawiki.. I had a p... [21:21:43] (03CR) 10BryanDavis: build: Fix printed image name (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938291 (owner: 10Majavah) [21:48:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:23] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:45:30] (03PS1) 10Cwhite: logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) [22:45:32] (03PS1) 10Cwhite: logstash: remove haproxy log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937601 (https://phabricator.wikimedia.org/T234565) [22:45:34] (03PS1) 10Cwhite: logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) [22:45:36] (03PS1) 10Cwhite: logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) [22:45:38] (03PS1) 10Cwhite: logstash: remove thanos log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937604 (https://phabricator.wikimedia.org/T234565) [22:45:40] (03PS1) 10Cwhite: logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) [22:45:42] (03PS1) 10Cwhite: logstash: remove node log cloning [puppet] - 10https://gerrit.wikimedia.org/r/938326 (https://phabricator.wikimedia.org/T234565) [22:48:52] (03CR) 10CI reject: [V: 04-1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:54:26] (03CR) 10CI reject: [V: 04-1] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:54:42] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:00:10] (03PS2) 10Cwhite: logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) [23:10:38] 10SRE, 10Data-Platform-SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10BTullis) [23:20:15] (03CR) 10Cwhite: [C: 03+1] udp2log: run mw-log-cleanup after logrotate [puppet] - 10https://gerrit.wikimedia.org/r/938228 (https://phabricator.wikimedia.org/T341691) (owner: 10Filippo Giunchedi) [23:23:32] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10BTullis) [23:35:30] (03PS43) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [23:37:44] (03PS44) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [23:42:42] (03CR) 10Dzahn: "it's close now, just one special case with the docker-registry test and that the "purge" is dropped when using mkdir_p. see here:" [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [23:44:10] (03CR) 10Dzahn: [C: 03+2] admin: remove old ssh key from user dzahn [puppet] - 10https://gerrit.wikimedia.org/r/934634 (owner: 10Dzahn) [23:44:16] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10BTullis) [23:48:38] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 3 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10BTullis) [23:50:14] I have succesfully revoked my own access for now. Will be back in late October. cu all and /quit :) [23:53:57] 10SRE, 10Data-Platform-SRE, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) [23:59:19] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 4 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10BTullis)