[00:05:06] (03CR) 10Dzahn: "needed manual rebase, meanwhile 6098 has been taken by devtools. so had to use 6099 and 6100" [puppet] - 10https://gerrit.wikimedia.org/r/828057 (owner: 10Chad) [00:06:46] (03PS2) 10Dzahn: codesearch: configure ports for design and discovery [puppet] - 10https://gerrit.wikimedia.org/r/828057 (owner: 10Chad) [00:18:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:49] (03CR) 10Dzahn: [C: 03+2] codesearch: configure ports for design and discovery [puppet] - 10https://gerrit.wikimedia.org/r/828057 (owner: 10Chad) [00:36:18] (03CR) 10Dzahn: [C: 03+2] "deployed on codesearch8: Notice: /Stage[main]/Codesearch/Systemd::Service[hound_proxy]/Service[hound_proxy]: Triggered 'refresh' from 2 ev" [puppet] - 10https://gerrit.wikimedia.org/r/828057 (owner: 10Chad) [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936820 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936820 (owner: 10TrainBranchBot) [00:46:58] (03CR) 10Dzahn: [C: 03+2] "I see now that docker-proxy is listening on 6099, not on 6100 though." [puppet] - 10https://gerrit.wikimedia.org/r/828057 (owner: 10Chad) [00:54:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936820 (owner: 10TrainBranchBot) [01:20:34] (HelmReleaseBadStatus) firing: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:41:01] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2013 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:45:29] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:46:58] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 19:00:00 on wdqs[2013,2022].codfw.wmnet with reason: new host [01:47:12] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 19:00:00 on wdqs[2013,2022].codfw.wmnet with reason: new host [02:00:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:05:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:08:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:49] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:20] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:06] (03PS1) 10RLazarus: opentelemetry-collector: Allow disabling liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937206 [02:42:08] (03PS1) 10RLazarus: opentelemetry-collector: Temporarily disable probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937207 [02:42:12] (LVSHighRX) firing: Excessive RX traffic on lvs1019:9100 (eno1np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [02:42:50] looking [02:44:40] rzl: looking as well [02:44:44] 👋 [02:45:45] nothing jumping out at me yet in webrequest data [02:46:21] 👋 [02:46:39] wait, lvs *rx*, never mind, of course not webrequest data [02:47:12] (LVSHighRX) resolved: Excessive RX traffic on lvs1019:9100 (eno1np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [02:48:06] rzl: why not webrequest data? [02:50:46] is turnillo working? [02:50:49] excessive RX is typically an incoming volumetric flood, right? rather than requests for a heavy file or whatevre [02:51:16] nothing is standing out to me in the turnilo netflow graphs but the graphs seems to be working [02:51:30] wmf_netflow? [02:51:40] yeah [02:51:59] ok, must be a user error of some sort :( [02:52:47] if you followed the deep link from the ddos playbook, that didn't deep link correctly, I had to futz with the parameters [02:52:58] try dragging "time" onto "split" to get a graph out of it [02:54:12] yup just got that, thanks [02:54:56] either way nothing that correlates with the spike in the prometheus data from lvs1019 [02:55:54] I'm inclined to leave it there unless anything else happens [02:56:13] sun spots. [02:56:15] I see the spike in host overview on grafana for lvs1019 [02:56:26] yeah [02:56:55] but I concur, let's wait and see if it pops again [03:03:13] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Allow disabling liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937206 (owner: 10RLazarus) [03:04:08] (03Merged) 10jenkins-bot: opentelemetry-collector: Allow disabling liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937206 (owner: 10RLazarus) [03:06:18] (03PS2) 10RLazarus: opentelemetry-collector: Temporarily disable probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937207 [03:06:20] (03PS1) 10RLazarus: opentelemetry-collector: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937208 [03:07:46] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937208 (owner: 10RLazarus) [03:08:34] (03Merged) 10jenkins-bot: opentelemetry-collector: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937208 (owner: 10RLazarus) [03:10:37] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2013 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:14:15] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [03:14:28] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [03:15:34] (HelmReleaseBadStatus) resolved: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:15:55] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [03:16:12] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [03:29:53] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [03:29:59] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [03:34:02] (03PS1) 10RLazarus: Revert "opentelemetry-collector: Allow disabling liveness and readiness probes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937210 [03:34:04] (03PS1) 10RLazarus: opentelemetry-collector: Specify otlp receiver only in pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/937211 [03:34:13] (03Abandoned) 10RLazarus: opentelemetry-collector: Temporarily disable probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937207 (owner: 10RLazarus) [03:35:18] (03CR) 10RLazarus: [C: 03+2] Revert "opentelemetry-collector: Allow disabling liveness and readiness probes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937210 (owner: 10RLazarus) [03:36:02] (03Merged) 10jenkins-bot: Revert "opentelemetry-collector: Allow disabling liveness and readiness probes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937210 (owner: 10RLazarus) [03:38:26] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Specify otlp receiver only in pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/937211 (owner: 10RLazarus) [03:39:10] (03Merged) 10jenkins-bot: opentelemetry-collector: Specify otlp receiver only in pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/937211 (owner: 10RLazarus) [03:39:31] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [03:39:53] PROBLEM - PHP opcache health on parse1015 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:41:07] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [03:41:19] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [03:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [04:08:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:11:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50277 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:14:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:19:39] (03CR) 10TChin: [C: 03+2] mw-page-content-change-enrich bump docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937145 (https://phabricator.wikimedia.org/T338169) (owner: 10TChin) [04:19:49] (03CR) 10TChin: [C: 03+2] Bump stream versions in mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/934719 (https://phabricator.wikimedia.org/T340746) (owner: 10TChin) [04:20:21] (03Merged) 10jenkins-bot: mw-page-content-change-enrich bump docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937145 (https://phabricator.wikimedia.org/T338169) (owner: 10TChin) [04:20:33] (03Merged) 10jenkins-bot: Bump stream versions in mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/934719 (https://phabricator.wikimedia.org/T340746) (owner: 10TChin) [04:32:59] RECOVERY - PHP opcache health on mw1467 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:35:16] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10phaultfinder) [04:35:19] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-July-September): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10KartikMistry) @akosiaris What should be the next step for this task? [04:35:58] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [04:36:31] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0600) [06:06:14] (03PS1) 10Muehlenhoff: Remove expiry date for ppenloglu [puppet] - 10https://gerrit.wikimedia.org/r/937347 [06:06:51] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:06:51] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) [06:09:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry date for ppenloglu [puppet] - 10https://gerrit.wikimedia.org/r/937347 (owner: 10Muehlenhoff) [06:10:35] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2013 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:11:52] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:11:52] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) [06:12:04] (03PS1) 10Muehlenhoff: Update email address [puppet] - 10https://gerrit.wikimedia.org/r/937348 [06:16:01] (03CR) 10Muehlenhoff: [C: 03+2] Update email address [puppet] - 10https://gerrit.wikimedia.org/r/937348 (owner: 10Muehlenhoff) [06:16:37] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:16:37] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) [06:18:07] volans: can I disable IRC logging when using the test-cookbook stuff? otherwise I'm going to flood IRC quite a bit :( [06:18:27] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:18:27] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) [06:24:46] (03PS1) 10Abijeet Patro: QueryMessageGroupActionApi: Apply sorting to groups only [extensions/Translate] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937117 (https://phabricator.wikimedia.org/T341627) [06:27:32] XioNoX: in theory you should test with dry-run until it reaches a state where it's mostly working so that real testing doesn't require too many iterations. It's also good to have SAL logging when RW operations are made. That said, if you really know what you're doing yes it's possible to disable it, I'll write to you in query the details [06:28:05] volans: I know about dry-run :) thanks [06:29:03] unfortunately my usecase reaches the end of what I can do with dry-run quite quickly, and I'm testing it on a test device [06:29:20] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:52] got it [06:37:28] (03PS5) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [06:37:47] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:37:47] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) [06:44:27] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10Ifrahkhanyaree) @KFrancis just sent you the email. Thank you! [06:45:32] (03PS6) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [06:45:45] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:45:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) [06:47:25] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:47:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) [06:54:53] (03PS1) 10Volans: test-cookbook: Add --no-sal-logging [puppet] - 10https://gerrit.wikimedia.org/r/937395 [06:55:19] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [06:55:19] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) [06:57:00] (03CR) 10Ayounsi: [C: 03+1] test-cookbook: Add --no-sal-logging [puppet] - 10https://gerrit.wikimedia.org/r/937395 (owner: 10Volans) [07:00:05] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:11] (03PS1) 10Elukey: eventgate: set a more performant default for queue.buffering.max.ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) [07:03:43] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T341652 (10Fokebox) [07:05:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [07:10:35] (03PS1) 10Muehlenhoff: Add frtech folks to run the DNS sync [puppet] - 10https://gerrit.wikimedia.org/r/937433 (https://phabricator.wikimedia.org/T336231) [07:26:22] (03PS7) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [07:26:29] PROBLEM - PHP opcache health on mw1446 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:26:33] PROBLEM - PHP opcache health on mw1439 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:26:41] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [07:26:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) [07:26:49] (03CR) 10Volans: [C: 03+2] test-cookbook: Add --no-sal-logging [puppet] - 10https://gerrit.wikimedia.org/r/937395 (owner: 10Volans) [07:29:35] !log ayounsi@cumin1001 START - Cookbook sre.network.tls [07:29:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) [07:30:15] PROBLEM - PHP opcache health on mw1458 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:48:31] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [07:53:00] (03PS1) 10JMeybohm: kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) [07:55:00] (03PS2) 10Slyngshede: Forgot username [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 [07:55:13] (03CR) 10Slyngshede: Forgot username (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 (owner: 10Slyngshede) [07:56:16] (03PS8) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [08:00:06] dduvall and hashar: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0800). nyaa~ [08:00:06] jelto and hashar: I, the Bot under the Fountain, call upon thee, The Deployer, to do Continuous Integration server upgrade deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0800). [08:02:37] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [08:07:05] hashar: I think this is a duplicate deployment calendar entry [08:07:24] (03CR) 10Volans: [C: 03+2] "Thanks for the fixes." [software/spicerack] - 10https://gerrit.wikimedia.org/r/937185 (owner: 10BCornwall) [08:09:46] (03CR) 10Elukey: "Andrew I am very ignorant about where the "guaranteed" producer is used, I think that setting either 5ms or 10ms is very good in general, " [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [08:11:18] (03Merged) 10jenkins-bot: Add some petty spelling error fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/937185 (owner: 10BCornwall) [08:16:09] (03PS9) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [08:21:41] (03CR) 10Muehlenhoff: sre: update base class with an upgrade action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [08:34:50] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:37:17] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy models for testwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/937438 (https://phabricator.wikimedia.org/T319170) [08:37:51] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: update gitaly prometheus exporter config for gitlab 16 [puppet] - 10https://gerrit.wikimedia.org/r/935753 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [08:39:42] (03CR) 10Ilias Sarantopoulos: "Models have been uploaded on swift" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937438 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:43:52] jelto: OOPS [08:43:52] (03PS10) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [08:44:13] looks like I have added the contint switch over to the wrong day :/ [08:45:50] jouncebot: refresh [08:45:51] I refreshed my knowledge about deployments. [08:45:53] jouncebot: now [08:45:53] For the next 1 hour(s) and 14 minute(s): Continuous Integration server upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0800) [08:45:53] For the next 1 hour(s) and 14 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0800) [08:45:57] yeah it lies [08:46:07] (03CR) 10Muehlenhoff: Move nftables/ferm types to wmflib (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:46:53] jouncebot: refresh [08:46:55] I refreshed my knowledge about deployments. [08:46:55] jouncebot: now [08:46:56] For the next 1 hour(s) and 13 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T0800) [08:47:08] what? [08:47:19] I made another fix to the wiki page [08:47:21] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2091940 [08:47:39] (noticed because the “jump to current event” button still sent me to the contint card even though it was now in yesterday’s section) [08:47:44] AHAH [08:47:47] thank you Lucas_WMDE :) [08:47:49] :) [08:49:53] (03CR) 10Elukey: [C: 03+1] ml-services: deploy models for testwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/937438 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:53:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/937433 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [08:54:14] (03CR) 10Jbond: "in fact i dont think this is needed as they already should have this access via dns-admins" [puppet] - 10https://gerrit.wikimedia.org/r/937433 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [08:55:33] !log move secondary instances away from ganeti2014 T341546 [08:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:38] T341546: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 [08:56:05] (03CR) 10Jbond: "dns1004: sudo -l -U dwisehaupt" [puppet] - 10https://gerrit.wikimedia.org/r/937433 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [08:57:33] (03CR) 10Muehlenhoff: Add frtech folks to run the DNS sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937433 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [08:57:41] (03Abandoned) 10Muehlenhoff: Add frtech folks to run the DNS sync [puppet] - 10https://gerrit.wikimedia.org/r/937433 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [09:03:28] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy models for testwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/937438 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [09:04:10] (03Merged) 10jenkins-bot: ml-services: deploy models for testwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/937438 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [09:07:12] (03CR) 10Jbond: Move nftables/ferm types to wmflib (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:08:08] (03PS2) 10Slyngshede: Allow users to update their email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) [09:08:15] (03CR) 10Slyngshede: Allow users to update their email address. (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [09:08:21] (03CR) 10Btullis: [C: 03+2] Configure the test datahub jobs to use the staging schema registry [puppet] - 10https://gerrit.wikimedia.org/r/936792 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:11:22] (03CR) 10Superpes15: thwiki: Update logos from commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935876 (https://phabricator.wikimedia.org/T341407) (owner: 10Func) [09:12:15] (03CR) 10Superpes15: "Ops sorry, was an old comment, just publishd by mista" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935876 (https://phabricator.wikimedia.org/T341407) (owner: 10Func) [09:13:29] (03PS11) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:15:50] (03PS12) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:16:53] (03PS2) 10Slyngshede: Allow users to be created in MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 [09:18:45] (03PS2) 10JMeybohm: kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) [09:18:47] (03PS1) 10JMeybohm: cfssl::cert: Add support for notifying multiple services [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) [09:18:49] (03PS1) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [09:19:09] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) > we (I) need to first understand the details of the needs of each other- please ping me... [09:20:01] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T341652 (10Aklapper) 05Open→03Stalled Hi! Per https://wikitech.wikimedia.org/wiki/Maps/External_usage, please fill out **all** fields in the form: **Link to site**: ... **Purpose/details about your project**:... [09:20:44] (03PS1) 10Muehlenhoff: Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/937443 [09:23:01] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T341652 (10Fokebox) Link to site: https://www.wikijournal.org Purpose/details about your project: Implementing maps to articles about tourism and others. Wikimedia Affiliate supporting project: No support from Wiki... [09:23:24] (03CR) 10Slyngshede: Allow users to be created in MediaWiki. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede) [09:28:33] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/937443 (owner: 10Muehlenhoff) [09:29:45] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [09:32:12] (03PS2) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [09:33:31] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:34:13] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:35:19] (03PS3) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [09:35:23] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:35:40] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:36:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [09:37:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labs_lvm: add `.sh` extension to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [09:37:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:39:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labs_lvm: pass shellcheck on scripts [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [09:39:44] (03PS4) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [09:41:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42414/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:43:42] (03CR) 10David Caro: "From the help, it seems that you are telling it to not wipe signatures no?" [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [09:44:19] (03CR) 10David Caro: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [09:44:31] (03PS13) 10Ayounsi: [WIP] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [09:45:21] (03CR) 10Hnowlan: [C: 03+1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [09:46:18] (03PS6) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [09:46:34] (03CR) 10Hnowlan: [C: 03+1] "imo we could remove nutcracker with this change also but I don't have a problem with it being two changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [09:47:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:47:59] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jbond) [09:49:47] (03CR) 10David Caro: labs_lvm: pass shellcheck on scripts (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [09:50:44] (03CR) 10David Caro: labs_lvm: pass shellcheck on scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [09:53:56] (03PS1) 10Effie Mouzeli: thumbor: helmfile changes for mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) [09:56:14] (03CR) 10Effie Mouzeli: thumbor: add mcrouter support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [09:56:47] (03CR) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [09:57:23] (03PS1) 10Jbond: bacula: update bacula config to trust the pki and puppet ca's [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T341664) [09:57:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1000) [10:04:15] (03CR) 10David Caro: [C: 03+1] labs_lvm: wipe fs signatures when creating logical volume (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [10:05:13] RECOVERY - PHP opcache health on mw1439 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:05:56] (03CR) 10David Caro: labs_lvm: pass shellcheck on scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [10:09:10] (03CR) 10Hnowlan: thumbor: helmfile changes for mcrouter support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [10:10:10] (03CR) 10Muehlenhoff: "A few additional smaller comments." [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [10:10:27] (03PS1) 10Lucas Werkmeister (WMDE): Html: Support more attr types in getTextInputAttributes() [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937121 (https://phabricator.wikimedia.org/T341566) [10:10:37] (03PS8) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [10:11:49] 10ops-codfw, 10Infrastructure-Foundations, 10netops: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10ayounsi) [10:14:29] (03PS14) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [10:21:50] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42419/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:23:30] jouncebot: nowandnext [10:23:30] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1000) [10:23:30] In 2 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1300) [10:23:41] (03CR) 10Ladsgroup: [C: 03+2] Externallinks: Keep domain wildcard if path is not specified [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937108 (https://phabricator.wikimedia.org/T326251) (owner: 10Ladsgroup) [10:24:16] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) Could potentially help {T341670} [10:24:30] (03CR) 10Hashar: labs_lvm: pass shellcheck on scripts (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [10:27:16] (03CR) 10David Caro: labs_lvm: pass shellcheck on scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [10:28:04] (03PS1) 10Ladsgroup: fix: add request headers properly [extensions/ORES] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937122 (https://phabricator.wikimedia.org/T319170) [10:28:20] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) @Jbond the problem is you keep thinking as using certificates for on the wire encryption and authenticat... [10:28:37] (03CR) 10Jcrespo: [C: 04-1] "See phabricator comment." [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T341664) (owner: 10Jbond) [10:29:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:37] (03CR) 10Fabfur: [V: 03+1] hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:32:46] (03CR) 10Vgutierrez: [C: 04-1] "puppetization looks good but IMHO we should deploy it first in one or two nodes, so I'd target specific text cluster nodes rather than the" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:33:13] (03PS5) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) [10:33:51] (03PS5) 10Jelto: Run LDAP group sync periodically on active gitlab server [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [10:34:08] (03PS3) 10Hashar: labs_lvm: add `.sh` extension to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/935422 [10:35:19] (03CR) 10Lucas Werkmeister (WMDE): "Can be verified on mwdebug in a maintenance shell:" [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937121 (https://phabricator.wikimedia.org/T341566) (owner: 10Lucas Werkmeister (WMDE)) [10:36:21] (03PS3) 10Hashar: labs_lvm: pass shellcheck on scripts [puppet] - 10https://gerrit.wikimedia.org/r/935423 [10:37:14] (03CR) 10Hashar: "I have done a grep to ensure all usage of `sopt` got converted to bash arrays:" [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [10:37:33] (03PS5) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) [10:37:35] (03CR) 10Hashar: labs_lvm: wipe fs signatures when creating logical volume (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [10:38:04] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674 (10jbond) [10:38:17] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674 (10jbond) p:05Triage→03Medium [10:39:12] (03Merged) 10jenkins-bot: Externallinks: Keep domain wildcard if path is not specified [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937108 (https://phabricator.wikimedia.org/T326251) (owner: 10Ladsgroup) [10:39:21] (03PS6) 10Gmodena: data-engineering: add alerts flink enrichment apps [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) [10:39:56] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674 (10jbond) [10:40:17] (03CR) 10Gmodena: data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [10:40:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:937108|Externallinks: Keep domain wildcard if path is not specified (T326251)]] [10:41:00] T326251: Write code for read new fields of externallinks - https://phabricator.wikimedia.org/T326251 [10:41:20] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674 (10jbond) [10:41:24] (03CR) 10Ladsgroup: [C: 03+2] fix: add request headers properly [extensions/ORES] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937122 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup) [10:42:33] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:937108|Externallinks: Keep domain wildcard if path is not specified (T326251)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:43:21] (03Merged) 10jenkins-bot: fix: add request headers properly [extensions/ORES] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937122 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup) [10:46:09] (03PS2) 10Effie Mouzeli: thumbor: helmfile changes for mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) [10:49:05] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:937108|Externallinks: Keep domain wildcard if path is not specified (T326251)]] (duration: 08m 09s) [10:49:09] T326251: Write code for read new fields of externallinks - https://phabricator.wikimedia.org/T326251 [10:50:01] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:937122|fix: add request headers properly (T319170)]] [10:50:04] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [10:51:34] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:937122|fix: add request headers properly (T319170)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:51:55] (03PS9) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [11:00:21] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:937122|fix: add request headers properly (T319170)]] (duration: 10m 20s) [11:00:26] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [11:06:09] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:13:38] (03PS10) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [11:15:01] PROBLEM - PHP opcache health on mw1466 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:16:04] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42422/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [11:20:08] (03PS1) 10Daimona Eaytoy: Add new campaign_events.event_answers_status column [extensions/CampaignEvents] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937123 (https://phabricator.wikimedia.org/T341142) [11:21:45] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42423/console" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [11:26:56] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet [11:29:27] (03CR) 10Hnowlan: cache: set api.wikimedia.org to normal caching (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [11:29:46] (03CR) 10Hnowlan: requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [11:29:48] (03CR) 10Hnowlan: [C: 03+2] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [11:30:52] (03Abandoned) 10Hnowlan: cache: set api-gateway to normal [puppet] - 10https://gerrit.wikimedia.org/r/862276 (owner: 10Hnowlan) [11:32:19] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] fluent-bit: install wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 (owner: 10Hnowlan) [11:32:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede) [11:33:49] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet [11:38:17] (03Merged) 10jenkins-bot: requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [11:39:28] (03CR) 10Muehlenhoff: "Few typos inline, looks good otherwise" [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [11:41:54] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) I have moved all primary and secondary instances off the node and temporarily removed it from the ganeti cluster, feel free to power it down/reboot as needed for troubleshootig. [11:43:19] (03CR) 10Majavah: Credit logo artist. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [11:43:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:43:38] !log rebuilding fluent-bit image to include wmf-certificates [11:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:11] !log rebalance ganeti codfw/C following reboots [11:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:03] (03CR) 10JMeybohm: cfssl::cert: Add support for notifying multiple services (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:47:41] (03PS2) 10JMeybohm: cfssl::cert: Add support for notifying multiple services [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) [11:47:43] (03PS3) 10JMeybohm: kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) [11:48:26] !log installing wireshark security updates [11:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:31] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:50:15] !log installing apache2 security updates on Bullseye [11:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:52] (03PS4) 10JMeybohm: kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) [11:50:54] (03PS5) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [11:51:07] (03PS3) 10Hnowlan: kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 [11:51:52] (03CR) 10CI reject: [V: 04-1] kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 (owner: 10Hnowlan) [11:55:25] RECOVERY - PHP opcache health on mw1458 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:56:12] (03PS1) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [12:00:05] (03PS4) 10Hnowlan: kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 [12:02:28] (03PS3) 10JMeybohm: cfssl::cert: Add support for notifying multiple services [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) [12:02:30] (03PS5) 10JMeybohm: kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) [12:02:32] (03PS6) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [12:04:27] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [12:04:47] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [12:04:58] (03CR) 10David Caro: [C: 03+1] "Thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [12:07:10] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 9 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42426/console" [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:08:09] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42427/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:11:39] !log mvernon@cumin2002 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web [12:13:43] (03PS7) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [12:15:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42428/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:17:07] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [12:17:24] (03PS1) 10Ilias Sarantopoulos: ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) [12:24:09] (03PS1) 10Majavah: P:toolforge: docker: enable --delete for the registry rsync [puppet] - 10https://gerrit.wikimedia.org/r/937454 [12:25:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42429/console" [puppet] - 10https://gerrit.wikimedia.org/r/937454 (owner: 10Majavah) [12:30:21] !log upgrade wikidiff2 1.13.0-1+wmf1+buster1 -> 1.14.1-0+wmf1+buster1 on mw-canary hosts T340087 [12:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:25] T340087: Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 [12:31:48] (03CR) 10Jelto: [V: 03+1] "looks mostly good thanks!, one comment in line." [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [12:43:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done" [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [12:44:02] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:51:41] RECOVERY - PHP opcache health on mw1446 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:52:26] !log imported wikidiff2 1.14.1-0+wmf1+buster1+icu67u1 to component/icu67 T340087 T329491 [12:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:31] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [12:52:32] T340087: Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 [12:54:12] !log rebalance ganeti codfw/D following reboots [12:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1300). [13:00:04] Lucas_WMDE and Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] o/ [13:03:37] (03PS1) 10Andrew Bogott: Revert "cinder backups: increase chunked backup file size" [puppet] - 10https://gerrit.wikimedia.org/r/937461 [13:03:39] o/ [13:05:11] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:17] I’ll start with my own (how rude) [13:05:17] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937121 (https://phabricator.wikimedia.org/T341566) (owner: 10Lucas Werkmeister (WMDE)) [13:05:51] FIFO ;) [13:06:06] Daimona: doesn’t your change need a corresponding schema change as well? [13:06:18] * Lucas_WMDE looks at deployment-prep [13:06:24] Yeah, I already have a task for it [13:06:39] https://phabricator.wikimedia.org/T341679 [13:06:41] !log installing node-tough-cookie security updates [13:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:56] But the change itself is a noop, it's just for consistency [13:06:59] I mean in the code, not in production [13:07:08] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10Papaul) @ayounsi we can still factory reset them and do ZTP again. [13:07:36] so that Beta *wouldn’t* need the kind of manual maintenance you’re apparently planning in https://phabricator.wikimedia.org/T341642 [13:08:03] (or already did?) [13:08:50] Ah, well... Since the tables are in x1, the updater wouldn't apply the schema changes anyway, which is why we're doing it all manually [13:09:08] oh, that again [13:09:09] Also the fact that the extension is still marked as experimental, so we're cheating a bit and we're not making backwards-compatible schema changes yet [13:09:18] (03PS1) 10Alexandros Kosiaris: maps: Allow usage by vikidia.org [puppet] - 10https://gerrit.wikimedia.org/r/937463 (https://phabricator.wikimedia.org/T339102) [13:09:20] still sounds like a pain for any developer using the extension though [13:09:41] I don’t generally check “were there any changes to the table definitions” each time I do a git pull, I expect `update.php` to do the needful [13:10:03] I would really like to mark it as stable after the upcoming release, meaning schema changes would be backwards-compatible. But unfortunately I don't have a fix for update.php not running for non-local databases. [13:10:37] Well, update.php does work if you configure the extension to use the local DB [13:10:40] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10ArielGlenn) A note that I did a test run of sql/xml dumps on deployment-prep with the new icu version and it looks fine to me, though I didn't check for any weird details of category sorting or whatever. [13:11:10] I think there's a task for it, let me see if I can find it [13:12:28] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cinder backups: increase chunked backup file size" [puppet] - 10https://gerrit.wikimedia.org/r/937461 (owner: 10Andrew Bogott) [13:12:40] (Hmmm, I can't) [13:13:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 (owner: 10Slyngshede) [13:14:05] But I'm sure there's one [13:15:00] I don’t see how update.php could work with the change I’m looking at [13:15:28] Ah, no, sorry. I meant that it would create the tables when you run it for the first time [13:15:37] well, yeah [13:15:45] but if you did that before this change, then you’re stuck with the old table definition… [13:16:03] Yes, that's right, and what I was talking about when I mentioned "cheating" [13:16:14] But that's something we're really planning to get resolved, possibly after this very schema change [13:16:19] (03PS5) 10Arturo Borrero Gonzalez: hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044 (owner: 10Majavah) [13:16:46] then I guess we’re on the same page, I just don’t like the “cheating” [13:17:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044 (owner: 10Majavah) [13:18:06] (03PS5) 10Arturo Borrero Gonzalez: P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [13:19:21] Yeah, I don't like it either, but it was just a temporary (TM) thing while we were still figuring out the schema [13:19:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:35] For reference, here's the task https://phabricator.wikimedia.org/T341392 [13:19:52] aha, an opportunity to link one of my favorite quips [13:19:55] https://bash.toolforge.org/quip/AU7VTzhg6snAnmqnK_pc [13:19:58] :P [13:20:44] That's one of my favourites too :D [13:20:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [13:21:12] (03Merged) 10jenkins-bot: Html: Support more attr types in getTextInputAttributes() [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937121 (https://phabricator.wikimedia.org/T341566) (owner: 10Lucas Werkmeister (WMDE)) [13:21:40] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:937121|Html: Support more attr types in getTextInputAttributes() (T341566)]] [13:21:43] T341566: With $wgUseMediaWikiUIEverywhere = true, Xml::input() with class attribute causes warning or TypeError: htmlspecialchars() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T341566 [13:21:54] (03PS4) 10Arturo Borrero Gonzalez: dynamicproxy: remove proxygetter [puppet] - 10https://gerrit.wikimedia.org/r/928457 (owner: 10Majavah) [13:22:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10Jhancock.wm) a:03Jhancock.wm [13:22:34] (03CR) 10Majavah: [C: 04-1] "this needs coordination for deploying" [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [13:22:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Lack of an update.php-compatible schema change is justified by T341392. +2ing already to shorten the `scap backport` later." [extensions/CampaignEvents] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937123 (https://phabricator.wikimedia.org/T341142) (owner: 10Daimona Eaytoy) [13:22:38] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [13:23:20] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:937121|Html: Support more attr types in getTextInputAttributes() (T341566)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:23:25] testing [13:23:28] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341503 (10Jclark-ctr) a:03Jclark-ctr [13:23:38] looks good, syncing [13:23:47] (tested in shell.php as described on the change) [13:24:09] (03CR) 10Alexandros Kosiaris: changeprop: Change normal_rule_processing to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [13:24:48] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 02m 09s) [13:24:49] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [13:25:15] (03Merged) 10jenkins-bot: Add new campaign_events.event_answers_status column [extensions/CampaignEvents] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937123 (https://phabricator.wikimedia.org/T341142) (owner: 10Daimona Eaytoy) [13:27:20] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 02m 30s) [13:27:37] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [13:28:12] oops, CE CI finished quicker than I expected [13:28:25] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 47s) [13:28:37] RECOVERY - Query Service HTTP Port on wdqs2013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:29:17] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:29:20] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:937121|Html: Support more attr types in getTextInputAttributes() (T341566)]] (duration: 07m 40s) [13:29:21] RECOVERY - WDQS SPARQL on wdqs2013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:29:23] T341566: With $wgUseMediaWikiUIEverywhere = true, Xml::input() with class attribute causes warning or TypeError: htmlspecialchars() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T341566 [13:29:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:30:00] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:937123|Add new campaign_events.event_answers_status column (T341142)]] [13:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:05] T341142: Add column to the DB to store whether participant answers have been aggregated for an event - https://phabricator.wikimedia.org/T341142 [13:30:55] Yeah, it's pretty fast :) [13:31:30] !log lucaswerkmeister-wmde@deploy1002 daimona and lucaswerkmeister-wmde: Backport for [[gerrit:937123|Add new campaign_events.event_answers_status column (T341142)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:31:40] nothing to test right? [13:31:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:31:56] Yup [13:32:36] re fast CI: https://www.youtube.com/watch?v=Ep-7kLhorjw&t=236s [13:33:07] (03PS11) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [13:33:52] (putting on my volunteer hat for a second – https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/937120 could also use some review and/or backporting, I think) [13:33:55] (if anyone feels so inclined ^^) [13:34:13] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:34:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:36:31] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42430/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:36:41] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T341652 (10Aklapper) 05Stalled→03Declined > Wikimedia Affiliate supporting project: No support from Wikimedia Affiliate project Unfortunately in that case I'm going to decline this request. Please see https://... [13:37:11] (03CR) 10Hashar: "TLDR: we can use git to find CRLF files and fix them all (git add --renormalize). This way we leverage code from upstream instead of custo" [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [13:37:59] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:937123|Add new campaign_events.event_answers_status column (T341142)]] (duration: 07m 59s) [13:38:02] T341142: Add column to the DB to store whether participant answers have been aggregated for an event - https://phabricator.wikimedia.org/T341142 [13:39:18] anything else to deploy? [13:40:17] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:40:40] !log UTC afternoon backport+config window done [13:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:08] (unusually, it was really only a backport window with no config changes. the other way around is more common ^^) [13:41:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm do we have any update on this? [13:41:26] (03PS3) 10Effie Mouzeli: thumbor: helmfile changes for mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) [13:41:45] (03PS1) 10Muehlenhoff: Rename Ferm::Hosts type to Wmflib::Firewall::Hosts and move to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/937488 (https://phabricator.wikimedia.org/T336497) [13:41:49] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:42:00] actually, I can just deploy a security fix real quick [13:42:10] (unrelated, but, might as well) [13:42:11] !log reprepro -C main include bullseye-wikimedia pdns-recursor_4.8.4-1+wmf11u1_amd64.changes: T341611 [13:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:14] T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 [13:42:34] (03Merged) 10jenkins-bot: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:43:23] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [13:43:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Jclark-ctr there is another server connected to port 24 on the switch ` papaul@fasw-c-eqiad# run show interfaces ge-[0-1]/0/24 descriptions Interface Admin... [13:44:04] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:44:14] Lucas_WMDE: Thanks for the deployment! (Sorry, I'm in a meeting and trying to multitask, which I'm very bad at :P) [13:45:06] (03CR) 10Effie Mouzeli: thumbor: helmfile changes for mcrouter support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:45:36] (03CR) 10Elukey: [C: 03+1] ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:45:38] (03CR) 10Vgutierrez: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:46:02] ^^ [13:46:25] Daimona: you’ll have to wait until later to watch the two-second video I sent, then :P [13:46:44] !log doh6001: upgrade to pdns-rec 4.8.4: T341611 [13:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:16] Oh, I didn't even notice :P [13:47:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/937488 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:48:40] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341503 (10Jclark-ctr) 05Open→03Resolved Relocated to new ports on mgmt Link returned [13:49:55] (03CR) 10Fabfur: [V: 03+1] hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:51:17] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) @MoritzMuehlenhoff I found a DIMM that is compatible. It has been replaced. [13:51:43] (03PS1) 10Muehlenhoff: Move Ferm::Protocol to wmflib (as generic Wmflib::Protocol) [puppet] - 10https://gerrit.wikimedia.org/r/937489 (https://phabricator.wikimedia.org/T336497) [13:51:51] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) [13:52:07] (03CR) 10CI reject: [V: 04-1] Move Ferm::Protocol to wmflib (as generic Wmflib::Protocol) [puppet] - 10https://gerrit.wikimedia.org/r/937489 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:52:11] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Jclark-ctr) [13:52:38] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Jclark-ctr) [13:52:54] I’m not deploying the security fix after all btw, needs more review [13:53:00] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) [13:53:04] so anyone else is free to go as far as I’m concerned [13:53:11] (03CR) 10Hnowlan: [C: 03+1] thumbor: helmfile changes for mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:53:18] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Jclark-ctr) [13:53:38] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) [13:53:58] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Jclark-ctr) [13:54:10] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Jclark-ctr) [13:54:25] PROBLEM - Check systemd state on cp1078 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:30] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) [13:54:52] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) [13:55:11] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Jclark-ctr) [13:55:28] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Jclark-ctr) [13:55:34] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: helmfile changes for mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:55:53] RECOVERY - Check systemd state on cp1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:15] (03Merged) 10jenkins-bot: thumbor: helmfile changes for mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937444 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:56:44] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [13:56:48] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:57:32] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [13:58:34] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) Thanks! [13:58:43] (03PS2) 10Muehlenhoff: Move Ferm::Protocol to wmflib (as generic Wmflib::Protocol) [puppet] - 10https://gerrit.wikimedia.org/r/937489 (https://phabricator.wikimedia.org/T336497) [13:59:14] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:02:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) @Papaul frdata1001 has been decom frav1003 is connected to port 24 [14:02:47] (03CR) 10Jbond: "lgtm just a few minor things thanks" [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:04:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: remove proxygetter [puppet] - 10https://gerrit.wikimedia.org/r/928457 (owner: 10Majavah) [14:04:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/937489 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:05:02] (03PS4) 10Arturo Borrero Gonzalez: dynamicproxy: move api files to api/ folder [puppet] - 10https://gerrit.wikimedia.org/r/928458 (owner: 10Majavah) [14:05:30] (03PS4) 10Arturo Borrero Gonzalez: mariadb::config::client: allow configuring default database [puppet] - 10https://gerrit.wikimedia.org/r/928461 (owner: 10Majavah) [14:05:36] (03PS5) 10Arturo Borrero Gonzalez: dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [14:05:41] (03PS7) 10Arturo Borrero Gonzalez: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [14:05:51] (03CR) 10Muehlenhoff: "I've spun" [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:05:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: move api files to api/ folder [puppet] - 10https://gerrit.wikimedia.org/r/928458 (owner: 10Majavah) [14:06:06] (03Abandoned) 10Muehlenhoff: Move nftables/ferm types to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:07:56] !log dns4003: upgrade to pdns-rec 4.8.4: T341611 [14:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:00] T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 [14:08:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] mariadb::config::client: allow configuring default database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928461 (owner: 10Majavah) [14:11:23] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [14:11:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Jclark-ctr thnaks will setup the new server to use that port [14:14:07] (03CR) 10Jbond: [C: 04-1] "<3 for adding spec tests, some minor issues inline otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [14:14:08] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405 (10Aklapper) [14:17:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [14:17:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/937488 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:18:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/937489 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:18:39] (03CR) 10Btullis: [C: 03+1] "Looks great. Do you need a +2 ?" [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [14:20:21] (03CR) 10Gmodena: [C: 03+2] data-engineering: add alerts flink enrichment apps [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [14:21:26] (03Merged) 10jenkins-bot: data-engineering: add alerts flink enrichment apps [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [14:22:18] (03CR) 10Vgutierrez: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:24:36] (03PS1) 10Gergő Tisza: Temporarily allow OAuth on non-API entry points again [extensions/OAuth] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937471 (https://phabricator.wikimedia.org/T341656) [14:30:18] Lucas_WMDE: sorry about the OAuth bug. Can you backport it? I have meetings coming up. [14:30:39] sure [14:31:00] * Lucas_WMDE justifies this as work-self deploying a change requested by someone else, unrelated to the fact that volunteer-self wrote the change :D [14:31:07] jouncebot: now [14:31:07] No deployments scheduled for the next 2 hour(s) and 28 minute(s) [14:31:47] hm, let me see how I’ll be able to test this on mwdebug [14:32:27] I’ll just register an OAuth2 owner-only consumer again [14:32:50] should only affect OAuth1 I think? [14:33:03] OAuth 2 uses an API endpoint for identify [14:33:14] (though certainly not harm in testing it) [14:33:31] yeah, but I can still tell the difference between a generic error page and a “this isn’t an OAuth 1 token you dummy” error JSON response ^^ [14:33:38] although, I probably don’t need a real token for that? [14:33:39] you can register a proper consumer, it will work with your own user account without approval [14:33:40] let me check [14:33:52] yeah but then I need to slap together some code to test it [14:34:20] there is an app for that, give me a sec [14:34:42] yeah Authorization: Bearer 0000 is enough, I don’t need a real client [14:34:58] (expected result is then {"error":"mwoauth-oauth-exception","message":"An error occurred in the OAuth protocol: Invalid consumer key"} instead of the generic mw error page) [14:35:01] (03PS12) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [14:35:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/OAuth] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937471 (https://phabricator.wikimedia.org/T341656) (owner: 10Gergő Tisza) [14:36:37] Lucas_WMDE: https://oauth-hello-world.toolforge.org/ [14:37:59] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42431/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:38:35] (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:38:53] tgr_: thanks, but I think I’ll just test with a fake bearer token and that should be enough for mwdebug [14:39:07] and once it’s rolled out everywhere I can check that e.g. phab login via mw.o is working again [14:40:42] (03Merged) 10jenkins-bot: Temporarily allow OAuth on non-API entry points again [extensions/OAuth] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937471 (https://phabricator.wikimedia.org/T341656) (owner: 10Gergő Tisza) [14:41:12] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:937471|Temporarily allow OAuth on non-API entry points again (T341656)]] [14:41:15] T341656: Special:OAuth/identify broken (affects pagepile+massviews tools, phab mw.o login, …) - https://phabricator.wikimedia.org/T341656 [14:42:44] !log lucaswerkmeister-wmde@deploy1002 tgr and lucaswerkmeister-wmde: Backport for [[gerrit:937471|Temporarily allow OAuth on non-API entry points again (T341656)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:43:27] looks good, `curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' -H 'Authorization: Bearer 0000' https://www.mediawiki.org/wiki/Special:OAuth/identify` returns JSON again [14:43:36] syncing [14:48:04] (03PS1) 10Effie Mouzeli: thumbor: mcrouter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937492 [14:48:28] !log upgrade dns2004 to gdnsd 3.99.0~alpha2 [14:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:37] (03CR) 10CI reject: [V: 04-1] thumbor: mcrouter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937492 (owner: 10Effie Mouzeli) [14:49:15] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:937471|Temporarily allow OAuth on non-API entry points again (T341656)]] (duration: 08m 03s) [14:49:18] T341656: Special:OAuth/identify broken (affects pagepile+massviews tools, phab mw.o login, …) - https://phabricator.wikimedia.org/T341656 [14:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:50:43] (03PS2) 10Effie Mouzeli: thumbor: mcrouter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937492 [14:52:48] tgr_: all done I think, enjoy your meeting [14:52:55] * Lucas_WMDE done deploying [14:55:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:55:47] (03CR) 10Kamila Součková: "Thank you! Pushing updated version. I will rename it to just "benthos" (and re-think config file) separately." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (owner: 10Kamila Součková) [14:56:06] (03PS6) 10Kamila Součková: [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [14:56:48] (03PS13) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [14:58:23] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42432/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:58:37] (03CR) 10Hnowlan: [C: 03+1] thumbor: mcrouter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937492 (owner: 10Effie Mouzeli) [15:00:14] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: mcrouter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937492 (owner: 10Effie Mouzeli) [15:01:04] (03Merged) 10jenkins-bot: thumbor: mcrouter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/937492 (owner: 10Effie Mouzeli) [15:03:10] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:05:03] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [15:07:14] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:07:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [15:08:14] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:08:21] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:08:29] BGP alerts in drmrs expected [15:08:47] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:08:55] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:08:59] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:08:59] PROBLEM - Query Service HTTP Port on wdqs2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:09:15] PROBLEM - WDQS SPARQL on wdqs2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:09:25] PROBLEM - Check systemd state on wdqs2014 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://w [15:09:25] wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:23] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:11:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [15:11:59] BFD status expected ^ [15:12:05] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:08] ^ this too [15:14:28] 10SRE, 10Data-Platform-SRE, 10Discovery-Search: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) [15:16:24] 10SRE, 10Data-Platform-SRE, 10Discovery-Search, 10vm-requests: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10jbond) [15:17:26] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM for MySQL Orchestrator - https://phabricator.wikimedia.org/T332718 (10jbond) p:05Triage→03Medium [15:17:43] 10SRE, 10Data-Platform-SRE, 10Discovery-Search, 10vm-requests: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10jbond) approved [15:17:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM for MySQL Orchestrator - https://phabricator.wikimedia.org/T332718 (10jbond) approved [15:18:26] 10SRE, 10Data-Platform-SRE, 10Discovery-Search, 10vm-requests: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) [15:24:13] (03PS1) 10Majavah: dynamicproxy: api: fix file location [puppet] - 10https://gerrit.wikimedia.org/r/937498 [15:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: api: fix file location [puppet] - 10https://gerrit.wikimedia.org/r/937498 (owner: 10Majavah) [15:25:16] (03PS2) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [15:25:40] (03PS1) 10Volans: irc: small refactor to cleanup the code [software/pywmflib] - 10https://gerrit.wikimedia.org/r/937499 [15:25:46] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674 (10jbond) >>! In T341674#9009233, @jhathaway wrote: > @jbond great writeup, do you think there is enough information here to log a bug report with... [15:28:55] (03PS6) 10Majavah: dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) [15:28:57] (03PS8) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [15:29:05] (03CR) 10Majavah: [C: 04-1] dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [15:33:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:01] (03PS1) 10Effie Mouzeli: thumbor: fix mcrouter prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/937500 [15:34:34] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:34:47] (03CR) 10Hnowlan: [C: 03+1] thumbor: fix mcrouter prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/937500 (owner: 10Effie Mouzeli) [15:35:04] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:41:58] (03CR) 10Volans: "In general LGTM, I think we could abstract and simplify a bit the shell part of junos but as a starter can be ok. See more detailed commen" [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [15:42:21] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: fix mcrouter prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/937500 (owner: 10Effie Mouzeli) [15:42:51] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [15:42:59] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:43:05] (03Merged) 10jenkins-bot: thumbor: fix mcrouter prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/937500 (owner: 10Effie Mouzeli) [15:43:26] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:43:46] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:45:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:31] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:54:03] (03CR) 10RLazarus: [C: 03+1] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [15:56:25] RECOVERY - PHP opcache health on parse1015 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:57:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [15:58:23] (ProbeDown) firing: (3) Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:21] (ProbeDown) resolved: (3) Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:49] (03CR) 10Eevans: [C: 03+1] "Insofar as I understand this; SGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 (owner: 10Hnowlan) [16:01:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [16:03:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:08:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:08:53] (03PS3) 10Milimetric: replicas: redact revdeleted, oversighted information [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [16:17:38] (03PS1) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) [16:19:39] (03CR) 10BCornwall: "This change is ready for review." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [16:20:35] (03CR) 10Krinkle: "I don't know whether this is the right place and way for this. Just copying what I see for navtiming, which seems to only be mentioned in " [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [16:20:49] (03CR) 10Ssingh: "Will leave the review to jbond and volans but thanks for working on this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [16:20:59] (03CR) 10CI reject: [V: 04-1] roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [16:21:22] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host durum6001.drmrs.wmnet with OS bookworm [16:21:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bullseye [16:23:14] (03CR) 10Btullis: [C: 03+2] "Patchset 3 looks good in our testing. Merging and deploying now." [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [16:25:06] (03PS3) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [16:25:53] (03CR) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:27:05] (03CR) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:29:24] (03PS4) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [16:30:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1013.eqiad.wmnet [16:31:13] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@a0e00cb] (releasing): (no justification provided) [16:31:24] (03PS4) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [16:31:34] (03CR) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:31:36] (03CR) 10CI reject: [V: 04-1] roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [16:31:41] (03PS7) 10Kamila Součková: [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [16:32:12] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@a0e00cb] (releasing): (no justification provided) (duration: 00m 58s) [16:32:17] (03CR) 10Kamila Součková: "fixed up the things my squirrel brain forgot about" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (owner: 10Kamila Součková) [16:32:57] RECOVERY - PHP opcache health on mw1466 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:34:55] (03PS5) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [16:35:11] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [16:37:06] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [16:37:13] (03PS5) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [16:37:46] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [16:38:43] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [16:40:07] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.243:3318, 208.80.154.243:3315, 208.80.154.243:3314, 208.80.154.243:3317, 208.80.154.243:3316, 208.80.154.243:3311, 208.80.154.243:3313, 208.80.154.243:3312]) https://wikitech.wikimedia.org/wiki/PyBal [16:40:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [16:40:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:40:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1013.eqiad.wmnet [16:40:43] (03CR) 10Jbond: "thanks for this lgtm couple of minor nits, the main one im curious anbout is supporting the no services.pp#45" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:41:26] (03CR) 10Jbond: nftables: spec: introduce service tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:41:29] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [16:41:49] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [16:42:31] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [16:42:45] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [16:44:31] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.243:3318, 208.80.154.242:3311, 208.80.154.243:3315, 208.80.154.243:3314, 208.80.154.243:3317, 208.80.154.243:3316, 208.80.154.243:3311, 208.80.154.243:3313, 208.80.154.243:3312, 208.80.154.242:3318, 208.80.154.242: [16:44:31] ttps://wikitech.wikimedia.org/wiki/PyBal [16:44:55] hmmm [16:45:53] btullis: ^ [16:47:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [16:48:09] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Ladsgroup) [16:48:44] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [16:49:18] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Ladsgroup) [16:51:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [16:51:55] (03PS6) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [16:53:55] (03PS7) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [16:54:31] (03CR) 10CI reject: [V: 04-1] nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:59:44] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [16:59:57] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [16:59:57] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1700) [17:00:07] (03PS6) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [17:02:25] (03CR) 10CI reject: [V: 04-1] roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [17:02:47] (03PS1) 10Volans: sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 [17:05:28] (03PS7) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [17:05:59] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:07:21] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:44] (03CR) 10CI reject: [V: 04-1] roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [17:07:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bullseye [17:09:10] So we've got another wikireplicas outage caused by my conning conftool to pool/depool/ the dbproxies. I'm not sure what to do with the `Services known to PyBal but not to IPVS` error. Do I restart pybal? [17:10:17] This is the doc for what happened the last time, but I didn't get to the root cause then: https://docs.google.com/document/d/1yo0pCpOSQ4waAPtWU06xMQUs7UMhM5wnAj2U4NOObj8/edit [17:10:31] Sorry I didn't get as far as writing it up on Wikitech either. [17:10:38] !log restart pybal on lvs1018 [17:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:11:09] btullis: yeah, a restart, though given that this happened again [17:11:16] we should look into a bit more carefully [17:11:25] I am about to head to a meeting so ping me if required; I will look at this later [17:11:28] thanks [17:13:24] (03PS1) 10Volans: sre.hosts.decommission: downtime mgmt only in AM [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 [17:15:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Dwisehaupt i update the switch configurations. You should be good now. ` papaul@fasw-c-eqiad# show | compare [edit interfaces interface-range vlan-administration]... [17:16:53] Hi there: I have a backport question: A bug that breaks the en page triage workflows was merged in monday night, and I have a fix for it here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/937169/ . I just attached it to the task with the breaking change, so it's already in this release project. I know we have a backport window in a few hours, and I'm happy to put this on the list and be present [17:16:53] for it. Can y'all take a look and see if this qualifies for backporting? Is there something else I should be doing for this? This is my first stop after reading through https://www.mediawiki.org/wiki/Backporting_fixes [17:16:55] sukhe: Thanks so much, I think we're good but I'll write it up and mention that this is a repeat incident. This time I'll try to get some tickets out of it to follow up properly. [17:17:04] btullis: yes that's a good idea [17:17:44] in a meeting and happy to talk shortly [17:21:37] (03CR) 10Volans: roll-restart-wikimedia-dns: Add reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [17:22:54] urbanecm, taavi, TheresNoTime, TheresNoTime: see JSherman's question [17:23:17] * taavi looks [17:23:20] (03PS1) 10Ayounsi: Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) [17:24:46] * urbanecm doesn't see why it shouldn't qualify for backporting [17:25:03] JSherman: that's a fairly small patch so there are very little possible downsides of backporting it (minus the time it takes of course). there's not really much more information in the commit message or the task for me to work on, so if you feel like it needs a backport then go for it [17:25:28] (03CR) 10CI reject: [V: 04-1] Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [17:25:57] (03PS15) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [17:26:07] (03CR) 10Ayounsi: "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [17:26:10] the main things that deployers will ask questions about are translation backports and patches adding new larger new features, but others we rarely have a problem with [17:26:22] ^^ [17:26:49] * TheresNoTime concurs [17:27:08] it sounds not backporting might break Special:PageTriage when on group2/enwiki, so backporting seems like a good idea. [17:27:32] and that was my dolt +2 which introduced it, so I'm happy to do the remedial work [17:28:16] TheresNoTime: we can actually blame you for once then :) [17:28:49] as if that normally stops people [17:29:13] TheresNoTime: oh no, I can blame you for anything. It's just true today :) [17:30:10] (03PS2) 10Ayounsi: Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) [17:31:40] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10cmooney) >>! In T341494#9004717, @aborrero wrote: >>>! In T341494#9004690, @cmooney wrote: >> Put cloudservices1005 in D5 if there is... [17:31:56] thanks all, I'll put it on the calendar and see you then! [17:32:18] (03CR) 10CI reject: [V: 04-1] Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [17:37:43] (03PS1) 10Samtar: [ruwiki] Add permissions to 'editor' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937511 (https://phabricator.wikimedia.org/T341707) [17:37:45] (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [17:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:39:55] (03CR) 10Dzahn: "I have no strong opinion, I am removing myself only because I will be out for some time and clean up my review queue." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [17:42:20] (03PS8) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [17:43:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:48:36] (03CR) 10Ayounsi: Juniper class-of-service config and updated border-in filter for QoS (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/931691 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [17:50:27] (03PS3) 10Ayounsi: Add python3-cryptography to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) [17:51:43] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/902513/42434/planet1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [17:53:33] (03PS1) 10Stang: ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937513 (https://phabricator.wikimedia.org/T341708) [17:56:12] (03CR) 10Dzahn: [C: 03+1] "@hashar @jelto Let me deploy it or can you deploy it tomorrow? Then my gerrit queue would be clean:)" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [18:00:04] dduvall and hashar: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1800). [18:00:05] dduvall and hashar: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T1800). [18:01:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dduvall@deploy1002 using scap backport" [extensions/Translate] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937117 (https://phabricator.wikimedia.org/T341627) (owner: 10Abijeet Patro) [18:04:24] (03PS1) 10Dzahn: microsites: add monitoring for statictendril site for retired services [puppet] - 10https://gerrit.wikimedia.org/r/937514 (https://phabricator.wikimedia.org/T340182) [18:06:56] (03CR) 10Dzahn: [C: 03+1] "exactly like other services we already moved" [puppet] - 10https://gerrit.wikimedia.org/r/937514 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn) [18:15:56] (03Merged) 10jenkins-bot: QueryMessageGroupActionApi: Apply sorting to groups only [extensions/Translate] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937117 (https://phabricator.wikimedia.org/T341627) (owner: 10Abijeet Patro) [18:16:24] !log dduvall@deploy1002 Started scap: Backport for [[gerrit:937117|QueryMessageGroupActionApi: Apply sorting to groups only (T341627)]] [18:16:28] T341627: TypeError: Argument 1 passed to MediaWiki\Extension\Translate\MessageGroupProcessing\MessageGroups::groupLabelSort() must implement interface MessageGroup, array given - https://phabricator.wikimedia.org/T341627 [18:17:56] !log dduvall@deploy1002 abi and dduvall: Backport for [[gerrit:937117|QueryMessageGroupActionApi: Apply sorting to groups only (T341627)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [18:18:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10jbond) [18:23:18] (03PS1) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 [18:23:41] (03CR) 10CI reject: [V: 04-1] role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (owner: 10Jbond) [18:23:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42435/console" [puppet] - 10https://gerrit.wikimedia.org/r/937518 (owner: 10Jbond) [18:24:15] (03PS2) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [18:24:47] !log dduvall@deploy1002 Finished scap: Backport for [[gerrit:937117|QueryMessageGroupActionApi: Apply sorting to groups only (T341627)]] (duration: 08m 22s) [18:24:51] T341627: TypeError: Argument 1 passed to MediaWiki\Extension\Translate\MessageGroupProcessing\MessageGroups::groupLabelSort() must implement interface MessageGroup, array given - https://phabricator.wikimedia.org/T341627 [18:25:50] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937519 (https://phabricator.wikimedia.org/T340245) [18:25:52] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937519 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [18:26:32] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937519 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [18:30:19] (03PS3) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [18:33:11] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.17 refs T340245 [18:33:15] T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245 [18:39:27] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.17 refs T340245 (duration: 06m 16s) [18:39:31] T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245 [18:43:58] train is all clear [18:46:18] all aboard [18:46:27] :D [18:55:38] (03PS4) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [18:55:40] (03PS1) 10Jbond: httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) [18:57:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42438/console" [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [18:57:52] (03CR) 10CI reject: [V: 04-1] httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [19:20:09] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10jbond) I wonder if we should instead move config-master to a VM. AFAIK the only reason that it needs to be on the... [19:20:57] (03CR) 10Ahmon Dancy: Run LDAP group sync periodically on active gitlab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [19:25:02] (03PS6) 10Ahmon Dancy: Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) [19:25:07] (03CR) 10Ahmon Dancy: Run LDAP group sync periodically on gitlab replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [19:33:51] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10jbond) [19:34:01] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10jbond) p:05Triage→03Medium [19:39:07] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10jbond) Just noting that we also run requestctl from the puppetmasteres. so either way they would at the very least need to be conftool::c... [19:40:13] (03PS2) 10Samtar: [ruwiki] Add permissions to 'editor' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937511 (https://phabricator.wikimedia.org/T341707) [19:41:42] (03PS1) 10Samtar: Fix Error: Module "./ext.pageTriage.defaultTagsOptions.js" is not loaded [extensions/PageTriage] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937478 (https://phabricator.wikimedia.org/T340112) [19:44:52] (03PS2) 10Jbond: httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) [19:44:54] (03PS5) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [19:44:57] (03PS1) 10Jbond: puppetserver: Add conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/937522 (https://phabricator.wikimedia.org/T341721) [19:45:09] (03PS3) 10Jbond: httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) [19:47:16] (03CR) 10CI reject: [V: 04-1] httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [19:47:18] (03PS4) 10Jbond: httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) [19:47:20] (03PS6) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [19:48:31] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [19:49:27] (03CR) 10jenkins-bot: httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [19:57:02] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2013.codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230712T2000). [20:00:05] TheresNoTime, JSherman, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] * TheresNoTime can deploy [20:00:20] o/ [20:00:22] starting with mine :) [20:00:24] I'm also here, but go ahead Sammy :) [20:00:31] :-) [20:01:05] urbanecm: want to give https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/937511 a very quick look so I'm not +2ing my own change unchecked? [20:01:19] (03CR) 10Urbanecm: [C: 03+1] [ruwiki] Add permissions to 'editor' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937511 (https://phabricator.wikimedia.org/T341707) (owner: 10Samtar) [20:01:24] ta [20:01:26] TheresNoTime: done :) [20:01:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937511 (https://phabricator.wikimedia.org/T341707) (owner: 10Samtar) [20:01:38] * urbanecm also suggests to +2 the backport now to save time on CI, but up2you TheresNoTime :) [20:02:14] (03Merged) 10jenkins-bot: [ruwiki] Add permissions to 'editor' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937511 (https://phabricator.wikimedia.org/T341707) (owner: 10Samtar) [20:02:28] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/PageTriage] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937478 (https://phabricator.wikimedia.org/T340112) (owner: 10Samtar) [20:02:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:937511|[ruwiki] Add permissions to 'editor' usergroup (T341707)]] [20:02:52] T341707: Add rights to "editor" usergroup in ruwiki - https://phabricator.wikimedia.org/T341707 [20:02:54] PageTriage only takes like 4 minutes, somehow [20:03:14] that's...surprisingly quick. [20:03:31] *ETA of 4 minutes, so.. ymmv [20:03:38] how long would you have expected? [20:03:58] 15 to 20 seems to be fairly normal afaik [20:04:00] CI on mediawiki code generally runs 15m-30m (hence my suggestion to +2 early) [20:04:18] !log samtar@deploy1002 samtar: Backport for [[gerrit:937511|[ruwiki] Add permissions to 'editor' usergroup (T341707)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:04:21] but it's _good_ it runs quickly. as long as it tests everything needed :) [20:04:28] hmm, does PageTriage not run the full gate? why would that be? [20:04:33] * TheresNoTime testing [20:04:45] (CA does that too, but at least it has a good reason for it) [20:05:10] * TheresNoTime syncing [20:05:33] hmm; this is a js-only change. would that make a difference? [20:06:20] taavi: maybe because it's (currently) practically an enwiki specific extension? but yeah, gate's missing (and that explains the speed). [20:06:20] (03PS1) 10Effie Mouzeli: thumbor: add failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937524 (https://phabricator.wikimedia.org/T318695) [20:07:07] that should probably be fixed then! :D [20:07:19] *unless there *is* a good reason [20:07:53] (03Merged) 10jenkins-bot: Fix Error: Module "./ext.pageTriage.defaultTagsOptions.js" is not loaded [extensions/PageTriage] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937478 (https://phabricator.wikimedia.org/T340112) (owner: 10Samtar) [20:08:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2013.codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:08:34] (03PS5) 10Jbond: httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) [20:08:36] (03PS7) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [20:08:36] JSherman: probably no difference on this patch. future patches merged in PageTriage might break tests in other extensions (that have their tests gated, aka running on +2 on any patch in core or any extension), which might cause a breakage. ocasionally, there is a good reason for not running gated tests in an extension (such as in CentralAuth). [20:09:10] generally it's a good idea to have gated tests running for all extensions, unless there is a good reason to not do that. that's something handled at the CI config layer, not something that's wrong in the extension itself. [20:10:05] urbanecm: thanks for the context! [20:10:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42440/console" [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [20:10:30] np [20:10:51] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:937511|[ruwiki] Add permissions to 'editor' usergroup (T341707)]] (duration: 08m 04s) [20:10:55] T341707: Add rights to "editor" usergroup in ruwiki - https://phabricator.wikimedia.org/T341707 [20:11:14] (03PS2) 10Effie Mouzeli: thumbor: add failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937524 (https://phabricator.wikimedia.org/T318695) [20:11:29] starting 937478 [20:11:49] !log samtar@deploy1002 Started scap: Backport for [[gerrit:937478|Fix Error: Module "./ext.pageTriage.defaultTagsOptions.js" is not loaded (T340112)]] [20:11:52] T340112: PageTriage: Add example jest smoke test for curation toolbar - https://phabricator.wikimedia.org/T340112 [20:13:23] !log samtar@deploy1002 samtar: Backport for [[gerrit:937478|Fix Error: Module "./ext.pageTriage.defaultTagsOptions.js" is not loaded (T340112)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:13:25] JSherman: live on mwdebug, can you test this? [20:13:41] TheresNoTime: testing now [20:15:19] verified that the toolbar is loading and that all of the non-destructive buttons work [20:15:21] https://phabricator.wikimedia.org/T333534 - Add PageTriage to gated extensions. Has a patch if anyone wants to review it. [20:15:34] (03PS3) 10Effie Mouzeli: thumbor: add failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937524 (https://phabricator.wikimedia.org/T318695) [20:15:36] JSherman: ack, syncing [20:16:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10KFrancis) Hi all, the agreement has been sent for signatures. I'll confirm when it's complete. [20:16:44] NovemLinguae: that's a patch for the other wide of the things (adding PageTriage to the list of gated extension). what is missing is that PageTriage doesn't run tests for (other) gated extensions. but thanks for the pointer to the task, i'll comment there. [20:17:14] (03PS4) 10Effie Mouzeli: thumbor: add failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937524 (https://phabricator.wikimedia.org/T318695) [20:17:29] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13,19].codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:17:34] but it adds the jobs to PageTriage too, so probably good enough :) [20:18:10] we can try deploying it if you want [20:18:31] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: add failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937524 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [20:18:36] (03PS2) 10Samtar: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [20:19:15] (03Merged) 10jenkins-bot: thumbor: add failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937524 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [20:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:50] taavi: i'll leave that decision to you as a nearby CI admin :). Kosta (the patch owner) is currently OoO until mid August, so if you're comfortable deploying that in that situation, let's go for it. [20:20:23] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [20:20:27] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [20:20:42] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [20:20:49] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [20:21:16] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:937478|Fix Error: Module "./ext.pageTriage.defaultTagsOptions.js" is not loaded (T340112)]] (duration: 09m 27s) [20:21:19] T340112: PageTriage: Add example jest smoke test for curation toolbar - https://phabricator.wikimedia.org/T340112 [20:21:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [20:21:35] let's give it a try then. worst case is that we have to revert [20:22:03] (03Merged) 10jenkins-bot: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [20:22:05] JSherman: live, and if I'm remembering correctly, I had to revert `log additional events on Special:Diff|MobileDiff` last time? [20:22:27] yeah, this is an updated patch which sets a value for sample rate [20:22:32] !log samtar@deploy1002 Started scap: Backport for [[gerrit:937096|log additional events on Special:Diff|MobileDiff (T326212)]] [20:22:34] T326212: Improve data logging on Special:Diff and Special:MobileDiff - https://phabricator.wikimedia.org/T326212 [20:23:15] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [20:23:15] ty taavi [20:23:44] cming indicated that was why the instrument wasn't posting [20:24:04] !log samtar@deploy1002 jsn and samtar: Backport for [[gerrit:937096|log additional events on Special:Diff|MobileDiff (T326212)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:24:06] JSherman: live on mwdebug, let's hope it works this time ^^ [20:24:11] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [20:24:22] TheresNoTime: testing! [20:24:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:16] testing with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/937525/ and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/937526/ [20:26:29] urbanecm: this does not look good: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium-docker/113602/console [20:27:01] indeed. why does it run CA though? [20:27:12] that's an excellent question I'm trying to answer atm [20:28:33] Ext:PageTriage seems to depend on Ext:Echo, and Ext:Echo has a dependency on Ext:CentralAuth [20:29:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13,19].codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:29:13] TheresNoTime: hmm, still not seeing any events posted. I'm just using the browser extension set to mwdebug1001. Am I dong something sill? [20:29:20] *silly*? [20:29:31] although extensions/Echo has `name: extension-gate` too [20:29:34] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[14,21,24].codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:30:02] JSherman: looking.. but I'm not sure how much help I can be with this! [20:30:08] taavi: and it is itself gated [20:30:32] JSherman: can you describe what you did to test the events? [20:30:54] (particularly where you are looking for incoming events) [20:31:32] urbanecm: ok, it broke core's test suite too. https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php74-docker/19931/console I'm reverting [20:31:46] 👍 for the revert [20:32:42] urbanecm: browser extension set as described, I'm going to a diff page eg. https://en.wikipedia.org/w/index.php?title=Liang%27s_Garden&diff=prev&oldid=712303476 and clicking on some of the items that we should be logging, such as previous / next edit. I'm then checking for posts to intage-analytics and curling https://stream.wikimedia.org/v2/stream/mediawiki.special_diff_interactions to see if I have a stream [20:33:02] *intake-analytics* [20:34:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:14] urbanecm: it might have been a second thing actually. we're coordinating in -releng [20:34:39] The instrument and config worked on beta, but there are enough config differences between beta and production that it's fully possible that my config isn't right. [20:34:40] ack. i'll focus on the events question then. [20:34:59] JSherman: afaik https://stream.wikimedia.org/v2/stream/mediawiki.special_diff_interactions is going to be a not found, as the stream is (presumably) not cleared for being streaming publicly. [20:35:19] ah, well there you go [20:37:06] do you need to do https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#EventStreams ? [20:37:21] urbanecm: In that case, I don't have the ability to check that end, though I can still check to see if I'm posting data; it looks like I'm not. [20:37:56] TheresNoTime: that'd allow you to view the events, but it looks like they're not sent out from the wiki [20:38:54] JSherman: do you mind pointing me where the events should be sent from? [20:39:47] like the instrumentation code? [20:39:52] yes [20:40:02] one moment... [20:41:38] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+/refs/heads/master/modules/ext.wikimediaEvents.specialPages/ [20:42:02] ty [20:43:15] TheresNoTime: i think we can sync the patch anyway, it doesn't appear to be breaking anything, so there is no need to block the sync itself. once the issue is identified, we can deploy a follow-up. does that sound good? [20:43:34] ack, sounds okay to me [20:43:45] * TheresNoTime syncing [20:44:36] (03PS7) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) [20:44:55] (03CR) 10Jbond: "ready for review" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond) [20:45:34] urbanecm: TheresNoTime: I appreciate y'all's time on this; thank you both. [20:46:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:53] (03PS2) 10Samtar: ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937513 (https://phabricator.wikimedia.org/T341708) (owner: 10Stang) [20:47:01] (03PS1) 10Urbanecm: Fix mediawiki.special_diff_interactions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937527 [20:47:06] JSherman: TheresNoTime: this should fix it. [20:47:28] (03PS8) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) [20:47:47] urbanecm: if JSherman OK's it, I'll deploy it next [20:48:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[14,21,24].codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:48:16] koi: might run a little late for your patch, are you okay to hang on? [20:48:19] d'oh! TheresNoTime: yeah, that looks right to me! [20:48:27] i'm here, it's ok [20:48:51] (03CR) 10Jsn.sherman: [C: 03+1] Fix mediawiki.special_diff_interactions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937527 (owner: 10Urbanecm) [20:49:13] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:937096|log additional events on Special:Diff|MobileDiff (T326212)]] (duration: 26m 41s) [20:49:16] T326212: Improve data logging on Special:Diff and Special:MobileDiff - https://phabricator.wikimedia.org/T326212 [20:49:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937527 (owner: 10Urbanecm) [20:50:03] (03Merged) 10jenkins-bot: Fix mediawiki.special_diff_interactions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937527 (owner: 10Urbanecm) [20:50:27] !log samtar@deploy1002 Started scap: Backport for [[gerrit:937527|Fix mediawiki.special_diff_interactions configuration]] [20:50:27] JSherman: for the actual testing, i usually rely on `new mw.Api().saveOptions({'eventlogging-display-console': 1})`, which will log instrumentation events to your console for easy access. might come helpful :). but ensuring the http req went out should be fine as well. [20:50:49] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[15-16,20,22,25].codfw.wmnet: Applying JVM update - eevans@cumin1001 [20:51:33] urbanecm: want to add that to some docs somewhere? (: [20:51:45] that would have been so very useful the past... six months [20:52:00] it appears to be documented already at https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Registering_the_stream_with_EventLogging ? [20:52:03] !log samtar@deploy1002 samtar and urbanecm: Backport for [[gerrit:937527|Fix mediawiki.special_diff_interactions configuration]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:52:13] not sure if there's a better doc [20:52:53] (: [20:52:59] JSherman: live on mwdebug [20:53:01] i can see an event incoming to mediawiki.special_diff_interactions [20:53:20] (`specialDiff.click.prev_link` in particular) [20:53:37] looking good! [20:53:43] syncing [20:53:50] TheresNoTime: oh, you mean the display console thing, right? not the cause of the schema not doing anything [20:54:01] (yeah) [20:54:06] okay, that i can document. [20:54:08] (03PS3) 10Samtar: ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937513 (https://phabricator.wikimedia.org/T341708) (owner: 10Stang) [20:56:28] thanks for the bonus fix and testing tip, urbanecm: [20:56:35] any time :) [20:59:15] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:937527|Fix mediawiki.special_diff_interactions configuration]] (duration: 08m 47s) [20:59:17] koi: ready [20:59:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937513 (https://phabricator.wikimedia.org/T341708) (owner: 10Stang) [20:59:48] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase20[15-16,20,22,25].codfw.wmnet: Applying JVM update - eevans@cumin1001 [21:00:15] (03Merged) 10jenkins-bot: ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937513 (https://phabricator.wikimedia.org/T341708) (owner: 10Stang) [21:00:39] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [21:00:43] !log samtar@deploy1002 Started scap: Backport for [[gerrit:937513|ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault (T341708)]] [21:00:44] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, and 2 others: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) @hashar is there a way we can control which worker a pcc job will run on based on the some param we pass to the `operations-puppe... [21:00:46] T341708: Change default search options on Russian Wikibooks - https://phabricator.wikimedia.org/T341708 [21:00:52] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [21:02:13] !log samtar@deploy1002 stang and samtar: Backport for [[gerrit:937513|ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault (T341708)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:02:14] koi: live on mwdebug [21:02:20] looking [21:02:30] (03PS1) 10Eevans: cassandra: uninstall cassandra-twcs deployment repository [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) [21:02:52] (03CR) 10CI reject: [V: 04-1] cassandra: uninstall cassandra-twcs deployment repository [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [21:03:26] TheresNoTime, I checked Special:Search and it works as expected [21:03:32] ack, syncing [21:08:54] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:937513|ruwikibooks: Add NS104 to wgNamespacesToBeSearchedDefault (T341708)]] (duration: 08m 10s) [21:08:57] T341708: Change default search options on Russian Wikibooks - https://phabricator.wikimedia.org/T341708 [21:09:01] all deployed :) [21:09:01] (03PS1) 10Jbond: pcc: add support for GERRIT_PRIVATE_CHANGE_NUMBER [puppet] - 10https://gerrit.wikimedia.org/r/937530 (https://phabricator.wikimedia.org/T265633) [21:09:17] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[16,20,22,25].codfw.wmnet: Applying JVM update - eevans@cumin1001 [21:09:19] (03CR) 10CI reject: [V: 04-1] pcc: add support for GERRIT_PRIVATE_CHANGE_NUMBER [puppet] - 10https://gerrit.wikimedia.org/r/937530 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [21:09:21] ty [21:09:21] !log close UTC late backport window [21:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:44] (03PS2) 10Eevans: cassandra: uninstall cassandra-twcs deployment repository [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) [21:13:03] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [21:17:14] (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: Add more pool options [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [21:19:42] (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: Add more pool options (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [21:19:54] (03Merged) 10jenkins-bot: wdqs.data-transfer: Add more pool options [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [21:25:15] (03CR) 10Jforrester: "+ Single Edit Tab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [21:28:15] (03CR) 10DVrandecic: "As discussed: in order to switch from "Edit source" to "Edit" on Object pages, we could configure the single-editor version of Visual Edit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [21:30:13] (03PS1) 10Jbond: pcc: update the parse commit method to support "Change private:" footer [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) [21:32:09] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [21:32:19] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [21:33:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[16,20,22,25].codfw.wmnet: Applying JVM update - eevans@cumin1001 [21:41:32] (03PS2) 10Jbond: pcc: add support for GERRIT_PRIVATE_CHANGE_NUMBER [puppet] - 10https://gerrit.wikimedia.org/r/937530 (https://phabricator.wikimedia.org/T265633) [21:41:34] (03PS2) 10Jbond: pcc: update the parse commit method to support "Change private:" footer [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) [21:43:56] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[12,17,18,23,26,27].codfw.wmnet: Applying JVM update - eevans@cumin1001 [21:48:17] (03PS1) 10Bking: wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T332314) [21:49:04] (03CR) 10Bking: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking) [21:49:34] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:50:38] (03CR) 10CI reject: [V: 04-1] wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking) [22:07:15] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:10:20] (03PS1) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 [22:11:49] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:18:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:18:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[12,17,18,23,26,27].codfw.wmnet: Applying JVM update - eevans@cumin1001 [22:23:43] (03PS3) 10Jbond: pcc: update the parse commit method to support "Change-Private:" footer [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) [22:25:00] (03PS1) 10Superpes15: [mywiki] Create 'autopatrolled' and 'patroller' usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937540 (https://phabricator.wikimedia.org/T341026) [22:25:42] (03PS2) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) [22:27:30] (03PS3) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) [22:30:48] (03PS4) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) [22:33:21] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:42:54] (03PS4) 10Jbond: pcc: update the parse commit method to support "Change-Private:" footer [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) [22:48:54] (03PS1) 10Arlolra: Set default for UseLegacyMediaStyles and disable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) [22:57:15] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:07:47] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[16,19,20,21,28,31].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [23:12:41] (03PS2) 10Jbond: rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) [23:16:45] 10SRE, 10SRE-Access-Requests: Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10thcipriani) [23:19:15] (03PS2) 10Jbond: rsyslog::receiver: update docs and add types [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741) [23:19:17] (03PS3) 10Jbond: rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) [23:20:23] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10thcipriani) [23:25:33] (03CR) 10Jbond: rsyslog: update to use pki certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [23:27:54] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:28:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:32:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/937499 (owner: 10Volans) [23:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:47:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[16,19,20,21,28,31].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [23:48:31] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [23:57:41] (03CR) 10Jbond: roll-restart-wikimedia-dns: Add reboot action (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall)