[00:15:07] 10Beta-Cluster-Infrastructure, 10Citoid: Update citoid service on beta - https://phabricator.wikimedia.org/T380165#10798561 (10bd808) @Mvolz The hints from @thcipriani are leading in the right direction, but probably not super clear. I think the explicit steps needed are: * Determine the exact tag from https:/... [00:25:54] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Seen): Provide a version of frwiki on Beta Cluster / staging - https://phabricator.wikimedia.org/T166290#10798588 (10bd808) https://fr.wikipedia.beta.wmflabs.org/wiki/Accueil exists and seems to have the expected French UI language. It does not seem to... [00:28:13] 10Beta-Cluster-Infrastructure, 06Growth-Team, 06Growth-Team-Filtering, 10StructuredDiscussions, 07Technical-Debt: Use dedicated $wgFlowCluster and $wgFlowDefaultWikiDb on Beta Cluster - https://phabricator.wikimedia.org/T147523#10798590 (10bd808) 05Open→03Declined Structured discussions is being... [04:43:04] 10MediaWiki-Releasing, 06translatewiki.net, 05MW-1.44-release, 13Patch-For-Review: Configure Translatewiki.net for REL1_44 - https://phabricator.wikimedia.org/T393514#10798731 (10Nikerabbit) [05:51:01] (03update) 10brennen: Draft: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [06:41:08] !log integration: bring back integration-agent-docker-1062 , I had it disconnected on April 30 at 6:30am UTC to clean /srv/jenkins/workspace and apparently forgot to put it back online [06:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [06:50:22] 10Continuous-Integration-Infrastructure: integration-agent-docker-1061 is offline - https://phabricator.wikimedia.org/T393542 (10hashar) 03NEW [06:54:33] 10Continuous-Integration-Infrastructure, 07Jenkins: Recently created Jenkins agents use the wrong credential (jenkins-deploy-toolforge) - https://phabricator.wikimedia.org/T393543 (10hashar) 03NEW [06:57:53] 10Gerrit, 06collaboration-services: Gerrit is unresponsive (2025-05-06) - https://phabricator.wikimedia.org/T393498#10798920 (10Jelto) [06:57:56] !log Added label `blubber` and `pipelinelib` to integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 # T393543 [06:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [06:57:58] T393543: Recently created Jenkins agents use the wrong credential (jenkins-deploy-toolforge) - https://phabricator.wikimedia.org/T393543 [06:58:13] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393470#10798923 (10Jelto) [06:58:23] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393470#10798924 (10Jelto) [06:58:42] !log Change ssh credentials for integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 to `key to connect to labs instances set up with role::ci::slave::labs::common` # T393543 [06:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [06:59:00] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393470#10798928 (10Jelto) 05Open→03Resolved a:03Jelto Alert recovered, I'll resolve this one because there are multiple other open task for this issue [06:59:45] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393460#10798934 (10Jelto) [07:00:06] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393460#10798937 (10Jelto) 05Open→03Resolved a:03Jelto Alert recovered, I'll resolve this one because there are multiple other open task for this issue [07:00:52] 10Continuous-Integration-Infrastructure, 07Jenkins: Recently created Jenkins agents use the wrong credential (jenkins-deploy-toolforge) - https://phabricator.wikimedia.org/T393543#10798941 (10hashar) 05Open→03Resolved a:03hashar Fixed by changing the credential being used and I have added the missing... [07:01:30] 10Continuous-Integration-Infrastructure: integration-agent-docker-1061 is offline - https://phabricator.wikimedia.org/T393542#10798944 (10hashar) [07:03:12] !log Hard rebooted integration-agent-docker-1061 via Horizon, the instance is not reachable by ssh and looks bricked # T393542 [07:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [07:03:15] T393542: integration-agent-docker-1061 is offline - https://phabricator.wikimedia.org/T393542 [07:06:25] 10Continuous-Integration-Infrastructure: integration-agent-docker-1061 is offline - https://phabricator.wikimedia.org/T393542#10798950 (10hashar) 05Open→03Resolved a:03hashar I have rebooted the instance and it might have been running just fine since Puppet did ran before the reboot. From the console I... [07:11:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:11:39] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393544 (10phaultfinder) 03NEW [07:16:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:19:14] 10Gerrit, 06collaboration-services: Gerrit is unresponsive (2025-05-06) - https://phabricator.wikimedia.org/T393498#10798964 (10Jelto) This happened again (T393544). But the current pattern looks a bit different then the usual scraping: I can see a spike of traffic from within WMCS (`integration-agent-docker-... [07:19:35] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393544#10798966 (10Jelto) [07:19:45] 06Release-Engineering-Team, 06collaboration-services: ProbeDown (gerrit1003) - https://phabricator.wikimedia.org/T393544#10798967 (10Jelto) [07:43:21] 10MediaWiki-Releasing, 10MediaWiki-Docker, 13Patch-For-Review: Remove docker-compose.yml from MediaWiki archive releases - https://phabricator.wikimedia.org/T393183#10799020 (10Douginamug) 05Open→03In progress p:05Triage→03Low [07:58:22] 10Beta-Cluster-Infrastructure, 07Epic: 2025 tracking task for Beta Cluster (deployment-prep) traffic overload protection (blocking unwanted crawlers) - https://phabricator.wikimedia.org/T393487#10799053 (10taavi) [07:58:24] 10Beta-Cluster-Infrastructure: Beta cluster IP block page should not point to noc@wikimedia.org - https://phabricator.wikimedia.org/T393404#10799054 (10taavi) [08:01:08] (03CR) 10Hashar: [C:04-1] Remove quotes from env var value in docs (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/1141863 (owner: 10Jakob) [08:10:51] jakob_WMDE: I am not so sure about removing quotes in `PHP_IDE_CONFIG="serverName=Local Server"` [08:11:13] that would assign `serverName=Local` to the variable and attempt to run `Server`? [08:11:50] and I don't get why the double quotes would be included in the assignments. But I must be missing something since it is surely broken for you :] [08:12:41] (03PS1) 10Hashar: dockerfiles: update Quibble to 1.14.1 [integration/config] - 10https://gerrit.wikimedia.org/r/1142976 [08:13:26] hashar: hehe, I was just trying that out again [08:13:42] maybe cause I read it as intended for a shell/bash [08:13:57] when it is processed by `docker run --env-file` which behaves differently? [08:14:03] to answer your comment on gerrit: I'm on Linux, and I've now checked it again and it definitely treats quotes as part of the value [08:14:12] and it works fine with spaces [08:14:28] so yeah, I guess `--env-file` is different from bash variable assignments in that regard [08:15:08] I should have tested it :b [08:15:17] $ podman run --env-file=env --rm -it --entrypoint=bash docker-registry.wikimedia.org/bullseye [08:15:17] root@70d4c9dff7e5:/# printf "foobar=%s\n" "$foobar" [08:15:17] foobar=this has spaces [08:15:25] tadaa [08:15:28] :D [08:16:12] root@a9005b705dd5:/# printf "foobar=%s\n" "$foobar" [08:16:12] foobar="docker is broken" [08:16:14] oh joy [08:16:42] yuppp :/ [08:16:58] since --env-file input supports comments (using # as a prefix) [08:18:05] may you amend your patch to mention something above the variable assignment to mention --env-file does no need quotes to prevent word splitting? [08:18:09] something like that :) [08:18:17] that is for the future selves [08:18:34] when I know i will one day bring back the double quotes cause that feels SO wrong :b [08:19:00] hehe sure, I'll amend the patch [08:19:08] (03CR) 10Hashar: [C:03+2] dockerfiles: update Quibble to 1.14.1 [integration/config] - 10https://gerrit.wikimedia.org/r/1142976 (owner: 10Hashar) [08:20:24] (03PS1) 10Hashar: jjb: switch jobs to Quibble 1.14.1 [integration/config] - 10https://gerrit.wikimedia.org/r/1142981 [08:20:46] (03Merged) 10jenkins-bot: dockerfiles: update Quibble to 1.14.1 [integration/config] - 10https://gerrit.wikimedia.org/r/1142976 (owner: 10Hashar) [08:20:57] and I am pushing to prod your patch to have QUIBBLE_OPENSEARCH env variable passed to the CirrusSearch maintenance script [08:22:30] nice, thanks! [08:23:18] jakob_WMDE: and if the change is only for documentation, the commit message first line can be prefixed with `doc: ` [08:23:25] that makes it stand out as a a doc only change :) [08:23:28] but I am picky [08:24:06] ref: https://en.wikipedia.org/wiki/Nitpicking :^\ [08:27:17] hashar: btw, I almost put the comment on the same line as the assignment in the env file... but that also becomes part of the value :| [08:27:35] ah yeah of course! [08:28:27] I thought about something like: [08:28:27] # Below as a space and --env-file does not need quotes to prevent splitting [08:28:27] MAGIC=accepts spaces in assignment [08:28:39] aka add the comment on a standalone line? [08:29:02] (03PS3) 10Jakob: doc: Remove quotes from env var value [integration/quibble] - 10https://gerrit.wikimedia.org/r/1141863 [08:29:24] yup, already done :D [08:30:59] I'm pretty sure I'll trip over both quoted values and comments in values in env files again one day :') [08:32:19] (03CR) 10Hashar: [C:03+2] doc: Remove quotes from env var value (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/1141863 (owner: 10Jakob) [08:32:23] yeah looks great thanks! [08:32:37] I guess my confusion was due to: [08:32:42] a) writing lot of shell scripts [08:32:47] with shellcheck [08:33:11] b) passing env variables from the shell command line using eg: `docker run -e foo="Local server"` [08:33:22] which does require double quotes cause that is the shell processing the command [08:34:12] yup, understandable! [08:47:28] 06Project-Admins, 07Tracking-Neverending: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#10799327 (10OWresch-WMF) Hi! I recently joined as EM for MediaWiki Interfaces and I'd like to have permission to create Milestones for the team. Thanks! [08:50:11] (03CR) 10CI reject: [V:04-1] doc: Remove quotes from env var value [integration/quibble] - 10https://gerrit.wikimedia.org/r/1141863 (owner: 10Jakob) [08:51:41] pff [08:52:04] !log Updating Jenkins jobs to Quibble 1.14.1 [08:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:52:40] 00:16:31.581 5) MediaWiki\CheckUser\Tests\Integration\GlobalContributions\SpecialGlobalContributionsTest::testExecutePreference [08:52:41] 00:16:31.581 Failed asserting that 1 is identical to 2. [08:52:41] :) [08:53:16] 06Project-Admins, 07Tracking-Neverending: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#10799341 (10Ladsgroup) >>! In T706#10799327, @OWresch-WMF wrote: > Hi! I recently joined as EM for MediaWiki Interfaces and I'd like to have permission to crea... [08:54:16] (03CR) 10Hashar: [C:03+2] "Trying again, CI failed due to some issue in CheckUser which might be transient." [integration/quibble] - 10https://gerrit.wikimedia.org/r/1141863 (owner: 10Jakob) [09:04:37] (03CR) 10Hashar: [C:03+2] "I have updated the Jenkins jobs to run Quibble 1.14.1" [integration/config] - 10https://gerrit.wikimedia.org/r/1142981 (owner: 10Hashar) [09:05:55] (03Merged) 10jenkins-bot: jjb: switch jobs to Quibble 1.14.1 [integration/config] - 10https://gerrit.wikimedia.org/r/1142981 (owner: 10Hashar) [09:13:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:35] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563 (10phaultfinder) 03NEW [09:15:53] (03CR) 10CI reject: [V:04-1] doc: Remove quotes from env var value [integration/quibble] - 10https://gerrit.wikimedia.org/r/1141863 (owner: 10Jakob) [09:18:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:06:00] 10Scap (SpiderPig 🕸️), 06Infrastructure-Foundations: Add deployment group users to spiderpig-access ldap - https://phabricator.wikimedia.org/T392958#10799489 (10SLyngshede-WMF) Seems reasonable yes, they've all been added, see: https://ldap.toolforge.org/group/spiderpig-access [10:06:07] 10Scap (SpiderPig 🕸️), 06Infrastructure-Foundations: Add deployment group users to spiderpig-access ldap - https://phabricator.wikimedia.org/T392958#10799490 (10SLyngshede-WMF) 05Open→03Resolved [10:30:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:38] Hey it looks like the most recent gerrit alerts correlate with CI jobs from WMCS, the traffic all comes from integration-agent-docker machines. See my comment in https://phabricator.wikimedia.org/T393498#10798962 [10:54:38] Has there been any significant change to the pipelines recently? Is there a way to reduce concurrency or the number of threads which are opened when pulling changes? [10:54:38] It looks like CI ist DoSing Gerrit at the moment [11:03:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:50] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563#10799734 (10phaultfinder) [11:08:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:41] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563#10799738 (10phaultfinder) [11:15:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:28] 10Continuous-Integration-Config, 10Math, 10MediaWiki-Platform-Team (Radar), 10MW-1.44-notes (1.44.0-wmf.27; 2025-04-29), 13Patch-For-Review: Allow control over which extra extensions are installed (Math REL1_43 jobs exceed 60min timeout) - https://phabricator.wikimedia.org/T389998#10799913 (10Daimona) Dr... [11:53:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:28] 10Gerrit, 06collaboration-services: Gerrit is unresponsive (2025-05-06) - https://phabricator.wikimedia.org/T393498#10799987 (10Jelto) In the past hours the `ProbeDown: Service gerrit1003` alert fired dozens of times. I checked a few of those incidents but could not find anything obvious beside a lot of WMCS m... [12:06:27] That's odd because CI doesn't seem to be particularly busy [12:07:04] I see a bunch of "CANCELED" jobs though and I'm not sure what that means [12:07:26] A new patchset I guess? [12:09:20] that is zuul killing a set of jobs because a change ahead in the queue has failed [12:09:34] so if you get a queue master <-- A <-- B [12:09:54] when a job for A failed, then all jobs made for B are cancelled [12:10:13] A will be dropped from the queue to form: master <-- B [12:10:18] and new jobs get triggered [12:10:33] why are cancelled jobs showing in the UI now when they were barely seen previously, I have no idea [12:11:12] those are in the test pipeline though [12:12:06] Like https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/13740/. But I imagine a new patchset would do it. [12:13:05] Oh but the chain that was just submitted won't help. [12:13:34] That's gonna make zuul cloner go berserk, no? And that might be what's causing gerrit instability [12:13:54] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1143071 [12:14:01] oh another reason is when a new patchset is sent [12:14:22] so that job for 1140705,20 got cancelled most probably because 1140705 received a new patchset [12:14:30] so Zuul cancels all the jobs for patchset 20 cause it is no more relevant [12:15:18] Right, so that makes sense. [12:15:46] But now for the gerrit outage, I imagine the large-scale-change above could do it given the dependencies? [12:19:32] 10Gerrit, 06collaboration-services: Gerrit is unresponsive (2025-05-06) - https://phabricator.wikimedia.org/T393498#10800045 (10Daimona) >>! In T393498#10799987, @Jelto wrote: > In the past hours the `ProbeDown: Service gerrit1003` alert fired dozens of times. I checked a few of those incidents but could not f... [12:19:44] gerrit is down? :) [12:19:45] :/ [12:20:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:35] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563#10800046 (10phaultfinder) [12:20:53] Not fully down, but unresponsive ^^ [12:21:10] yes I mentioned this a bit earlier and its alerting constantly [12:21:34] I'm checking to see if the timeline of that core change aligns with the alerts [12:21:57] The vast majority of those changes was submitted at 10:22 UTC [12:22:28] it looks like the CI peaks are just the tipping point when gerrit fails. The root cause may be just high traffic which 4x in the last 10 weeks [12:22:52] Then a new patchset for each around 10:50-11:05 [12:23:19] And a bunch of rechecks triggered 10 minutes ago (12:23) [12:23:57] And yeah, it would make sense that this is just the straw breaking etc [12:25:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:01] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:05] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563#10800137 (10phaultfinder) [12:43:01] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:35] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563#10800436 (10phaultfinder) [13:56:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:44] 06Release-Engineering-Team, 06collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T393563#10800449 (10phaultfinder) [14:01:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:43] 06Release-Engineering-Team, 10Scap, 10Dumps-Generation: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761#10800517 (10BTullis) >>! In T388761#10784839, @Scott_French wrote: > IMO, I think it makes sense to standardize on either "clusters" (i.e., in the k8s sense) or "environments... [14:27:17] 10Phabricator: Custom task form for #MW-Interfaces-Team - https://phabricator.wikimedia.org/T392598#10800619 (10MBinder_WMF) @OWresch-WMF fair enough! I wonder if it might be easiest to start with task count. E.g., "how many tasks were committed to and not finished?" Then simply reduce the next sprint by that ma... [15:25:48] 10Release-Engineering-Team (Doing 😎), 10Scap (SpiderPig 🕸️), 10Codex, 07Epic: [EPIC] scap web interface: Create SpiderPig web UI - https://phabricator.wikimedia.org/T375782#10800869 (10dancy) 05In progress→03Resolved Considering this done. [15:26:25] 10Beta-Cluster-Infrastructure, 07Epic: 2025 tracking task for Beta Cluster (deployment-prep) traffic overload protection (blocking unwanted crawlers) - https://phabricator.wikimedia.org/T393487#10800871 (10bd808) [15:26:26] 10Beta-Cluster-Infrastructure: High load on deployment-mediawiki14 and slow responses - https://phabricator.wikimedia.org/T392534#10800872 (10bd808) [16:33:37] Looks like CI is being flaky again. A couple of my patches have failed to merge since arlo's +2 yday and 3 retries since. is this known? Should I file a phab task? See https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/1141191 [16:37:19] subbu: task welcome against #ci-test-error [16:39:49] subbu: maybe that is https://phabricator.wikimedia.org/T393531 [16:40:25] which mentions parsoid and is apparnetly fixed/worked aroudn by https://gerrit.wikimedia.org/r/1143078 [16:41:07] ah .. okay .. let me retry! thx. [16:45:14] subbu: looking at your patch: the first failure was T389536 which nobody knows why is happening. Second failure was the task linked above, a semi-legit failure in an individual CheckUser test that has since been fixed. Third and last failure (QUnit's "Disconnected during run") I'm not sure, I definitely saw it happen a few times and reported in T388416 back in March. My guess is that CI sometimes suffers from latency spikes (unsure [16:45:14] if CPU or IO), and when they happen during a JavaScript-ish test (QUnit, selenium, api-testing) they can trigger a timeout. [16:45:14] T389536: Selenium timeouts can cause the job to remain stuck until the build times out - https://phabricator.wikimedia.org/T389536 [16:45:15] T388416: CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416 [16:45:46] The tricky bit is that it's not clear whether and how these three issues are related. The second is surely unrelated, but no idea about the other two. [16:47:01] ya ... i am used to random selenium retries and stuck builds by now ... and i simply retry them ... but usually, they resolve after one or two retries. in this case, it seemed different with different jobs failing each time. anyway, i retried it again now. [16:53:35] I think CI is undeniable less stable lately. I'm not sure how, when, or why it started, but especially browser tests seem way more brittle than before. [16:54:05] When I filed the "March 2025" task above I still thought it might have been just an impression, but time confirmed that it wasn't [16:59:56] yes, more flaky lately .. but looks like that first patch merged now. Hopefully once we fully migrate out to 7.4 to 8.1 .. we will have fewer CI jobs to run and hence reduced flakiness at least. [17:01:19] 10Continuous-Integration-Infrastructure, 07ci-test-error: QUnit error: "Disconnected during run, waiting 2000ms for reconnecting." in ve.ce.Document - https://phabricator.wikimedia.org/T393624 (10Daimona) 03NEW [17:01:39] I hope so. For the time being, I filed ^ task [17:01:50] seems like a couple of problems: the march 2025 problem may have been a longstanding bug that's surfaced due to higher demand on servers (in part due to running tests for three php versions in parallel), teams experimenting with cypress, and a move to webdriver v8 in progress: lots of opportunity for instability at the moment [17:02:44] 10Continuous-Integration-Infrastructure, 07ci-test-error: QUnit error: "Disconnected during run, waiting 2000ms for reconnecting." in ve.ce.Document - https://phabricator.wikimedia.org/T393624#10801482 (10Daimona) [17:02:50] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Testing Support, and 3 others: CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416#10801483 (10Daimona) [17:03:03] It definitely feels like this started around the time we began testing stuff with PHP 8.1, yes. [17:05:51] Checking the first and third examples I reported in T388416, both wmf-quibble-selenium-php81, they're for builds #902 and #1320 respectively and just one day apart, so that basically matches the date when we introduced the new job [17:05:52] T388416: CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416 [17:49:35] 06Gerrit-Privilege-Requests, 10WDLexAudio: Request for plus 2 rights - https://phabricator.wikimedia.org/T391414#10801708 (10Collins) >>! In T391414#10724997, @taavi wrote: > The relevant [[ https://gerrit.wikimedia.org/r/admin/groups/5a653dd39f435e24b659cb1ceedc873b34abffa0 | Gerrit group ]] is self-owned so... [18:29:43] (03update) 10brennen: Draft: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [18:32:49] 10Phabricator (Search): Dated elections no longer top results in search preview on en.wikipedia - https://phabricator.wikimedia.org/T393635 (10Thesavagenorwegian) 03NEW [18:42:54] 10GitLab (Account Approval), 06Release-Engineering-Team: Requesting GitLab account activation for Bmartinezcalvo - https://phabricator.wikimedia.org/T393385#10801965 (10Pppery) 05Open→03Resolved Closing as resolved as the account https://gitlab.wikimedia.org/bmartinezcalvo is now approved, presumably b... [19:06:32] (03update) 10brennen: Draft: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:11:41] (03update) 10brennen: Draft: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:37:15] (03update) 10brennen: Draft: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:37:56] (03update) 10brennen: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:41:15] (03update) 10brennen: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:42:34] (03open) 10dancy: spiderpig: api.py: Disable openapi URL [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/792 [19:42:37] (03update) 10dancy: spiderpig: api.py: Disable openapi URL [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/792 [19:43:41] 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Scap (SpiderPig 🕸️), 13Patch-For-Review: Automatically select first backport search match - https://phabricator.wikimedia.org/T392508#10802167 (10bd808) 05In progress→03Resolved Let's call this one complete. I will fork a lower priority task to track the... [19:44:08] (03update) 10dancy: spiderpig: api.py: Disable openapi URL [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/792 [19:45:43] (03update) 10brennen: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:46:46] 10Scap (SpiderPig 🕸️): SpiderPig UI: Add titles of backport patches to UI - https://phabricator.wikimedia.org/T393128#10802185 (10bd808) @jeena Are you thinking that the "chips" that get added when you select from the autocomplete should be the full titles, or would you want to show this information somewhere el... [19:47:01] (03update) 10brennen: SpiderPig log view [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/767 (https://phabricator.wikimedia.org/T391005) [19:49:36] 10Scap (SpiderPig 🕸️): SpiderPig UI: Add titles of backport patches to UI - https://phabricator.wikimedia.org/T393128#10802203 (10jeena) >>! In T393128#10802185, @bd808 wrote: > @jeena Are you thinking that the "chips" that get added when you select from the autocomplete should be the full titles, or would you w... [19:51:50] 10Scap (SpiderPig 🕸️): SpiderPig UI: Add titles of backport patches to UI - https://phabricator.wikimedia.org/T393128#10802205 (10bd808) > easily recognize what kind of changes are going out In that vein it might be nice to add an icon to differentiate between config changes and code backports too. [19:53:06] 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Scap (SpiderPig 🕸️): Integrate SpiderPig with the Deployment calendar - https://phabricator.wikimedia.org/T392507#10802222 (10bd808) [20:00:49] 10Scap (SpiderPig 🕸️): Automatically select matched changes triggered by a space or comma in the change number input widget - https://phabricator.wikimedia.org/T393644 (10bd808) 03NEW [20:01:04] 10Scap (SpiderPig 🕸️): Automatically select matched changes triggered by a space or comma in the change number input widget - https://phabricator.wikimedia.org/T393644#10802259 (10bd808) p:05Triage→03Low [20:50:26] 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Scap (SpiderPig 🕸️): Integrate SpiderPig with the Deployment calendar - https://phabricator.wikimedia.org/T392507#10802447 (10bd808) * https://wikitech.wikimedia.org/w/index.php?title=Module%3AGerrit&diff=2299579&oldid=2148302 * https://wikitech.wikimedia.org/w/in... [20:51:01] (03merge) 10annet: releases: Bump Codex to 2.0.0-rc.1 [repos/ci-tools/libup-config] - 10https://gitlab.wikimedia.org/repos/ci-tools/libup-config/-/merge_requests/73 (https://phabricator.wikimedia.org/T391012) (owner: 10volker-e) [21:04:55] 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Scap (SpiderPig 🕸️): Integrate SpiderPig with the Deployment calendar - https://phabricator.wikimedia.org/T392507#10802536 (10bd808) https://wikitech.wikimedia.org/w/index.php?title=Template:Deploy&diff=prev&oldid=2299595 dropped the `(deploy commands)` bacc link. [21:10:59] 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Scap (SpiderPig 🕸️): Integrate SpiderPig with the Deployment calendar - https://phabricator.wikimedia.org/T392507#10802548 (10bd808) 05Open→03Resolved a:03dancy This was a team effort, but I think @dancy did the harder parts on the SpiderPig side so let'... [21:22:19] 10Scap (SpiderPig 🕸️): SpiderPig UI: Add titles of backport patches to UI - https://phabricator.wikimedia.org/T393128#10802589 (10bd808) The detail screen for each backport job is currently showing the patch title as well as an expandable panel with the full commit message: {F59753813, size=full} [21:24:22] 10Scap (SpiderPig 🕸️): SpiderPig UI: Add titles of backport patches to UI - https://phabricator.wikimedia.org/T393128#10802591 (10bd808) >>! In T393128#10802203, @jeena wrote: > I was thinking like under the job history section. Those rows in the Job History section currently look like: {F59753881, size=full} [21:29:27] 10Scap (SpiderPig 🕸️), 13Patch-For-Review: spiderpig: Link URLs in prompts - https://phabricator.wikimedia.org/T392795#10802602 (10bd808) 05Open→03In progress a:03dancy [21:30:10] 10Scap (SpiderPig 🕸️), 13Patch-For-Review: spiderpig: Link URLs in prompts - https://phabricator.wikimedia.org/T392795#10802608 (10bd808) p:05Triage→03Medium [21:30:39] 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Scap (SpiderPig 🕸️): SpiderPig should support train deployments - https://phabricator.wikimedia.org/T392610#10802611 (10bd808) 05Open→03In progress [21:31:53] (03merge) 10thcipriani: spiderpig: api.py: Disable openapi URL [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/792 (owner: 10dancy) [21:52:43] 10Scap (SpiderPig 🕸️): Accept unconsumed TOTP tokens from the T-1, T+0, and T+1 windows - https://phabricator.wikimedia.org/T393651 (10bd808) 03NEW [21:52:55] (03update) 10bd808: spiderpig-otp: Add --qr flag to generate QR code [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/784 (owner: 10dancy) [22:25:50] 10Beta-Cluster-Infrastructure: Beta cluster IP block page should not point to noc@wikimedia.org - https://phabricator.wikimedia.org/T393404#10802746 (10bd808) This bit in [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/004fb996b8f8798ca6de266d91450969d7999e76/modules/varnish/templates/wikime... [22:49:21] 10Beta-Cluster-Infrastructure: Beta cluster IP block page should not point to noc@wikimedia.org - https://phabricator.wikimedia.org/T393404#10802782 (10bd808) There are three more 4xx synthetic responses in the same area of the Varnish config. They are all related to user-agent blocks of various kinds and all se... [22:49:58] 10Beta-Cluster-Infrastructure: Add allowlist to make poking holes in abuse_networks:blocked_nets:networks easier - https://phabricator.wikimedia.org/T393481#10802783 (10bd808) The simplest thing to do here might just be adding a new `allowed_nets` acl and some boolean logic like: `lang=vcl if (std.ip(req.http.X-... [23:14:18] 10Phabricator, 03Wikimedia-Hackathon-2025: Deploy user style to reduce bot comments on Phabricator - https://phabricator.wikimedia.org/T393289#10802837 (10bd808) >>! In T393289#10793098, @matmarex wrote: > We also talked a bit about introducing a new type of transaction that Gerritbot could use, or a new "para...