[06:19:21] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11792333 (10ABran-WMF) >>! In T421827#11789099, @hashar wrote: > It is not an issue with the software (git) bein... [07:49:38] 06Traffic, 10Data-Services, 10Datasets-General-or-Unknown, 06tools-infrastructure-team: Migrate clouddumps https/rsync interfaces behind LVS - https://phabricator.wikimedia.org/T422040#11792514 (10taavi) p:05Triage→03Medium [08:26:35] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11792610 (10JAllemandou) >>! In T422030#11789124, @Vgutierrez wrote: An interesting change in behavior from 3.0 to 3.2 and that could be related is that after the u... [08:32:49] 06Traffic, 10Data-Services, 10Datasets-General-or-Unknown, 06tools-infrastructure-team: Migrate clouddumps https/rsync interfaces behind LVS - https://phabricator.wikimedia.org/T422040#11792643 (10taavi) a:03taavi Per Traffic this should be a high-traffic2 service. I have allocated a VIP, namely ` dumps-... [08:43:42] who should I ask to review some patches to add a new LVS service? [08:48:44] 06Traffic, 06Infrastructure-Foundations: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot - https://phabricator.wikimedia.org/T422261#11792759 (10Volans) IMHO the solution here is to create a dedicated cookbook for the LVS that has all the logic needed for LVS reboots (reboot first t... [08:54:45] 06Traffic, 13Patch-For-Review: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11792778 (10Ladsgroup) I have tightened the rate limit to 2 per server per minute (with block time of 60 seconds) from 10 per server per minute (with block time of 10 seconds) since th... [09:09:57] taavi: any of us really. I'm at the hospital and fab.fur is ooo till this afternoon. We can review them later today. [09:22:09] 06Traffic, 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11792871 (10hnowlan) 05Open→03Resolved a:03hnowlan Glad that it's sorted out! [09:29:15] thanks, not that urgent, take care. I'll add you both [09:34:56] 10netops, 06Traffic, 10DNS, 06Infrastructure-Foundations, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11792933 (10ayounsi) What would be a good day to alert about those ? Or even better, not even need an alert ? [10:25:28] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793088 (10gmodena) Hi, Thanks for reaching out. Roughly speaking, we start to throttle connections (for bots that respect maxlag) when the change propagation lag betw... [11:39:47] hello, could I get a review on this "just in case" change ? https://gerrit.wikimedia.org/r/c/operations/dns/+/1268538 [12:00:04] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793405 (10Ladsgroup) FWIW, if the maxlag is consistently high but some bots are still editing so fast that are keeping wdqs under pressure, it is a clear violation of... [12:01:33] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11793409 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5c0f0433-7743-49c9-b411-8f120a9f337d) set by ayounsi@cumin1003 for 1:00:00 on 3 host(s)... [12:15:48] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793455 (10Ladsgroup) The top editor yesterday and the day before was Mahir256 with 40K edits each day. The day before that was @Epidosis with 203K edits(!), the day b... [12:20:29] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793465 (10Mahir256) @Ladsgroup both Epìdosis and I were using QuickStatements (he version 3.0 and I version 2.0); your complaint about tools not respecting maxlag shou... [12:22:27] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793473 (10magnusmanske) Oops, I'll fix it in V2 [12:23:43] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793480 (10Ladsgroup) Thanks! [12:45:05] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11793614 (10ABran-WMF) >>! In T421827#11785472, @A_smart_kitten wrote: >>>! In T421827#11783908, @SomeRandomDeve... [12:49:46] taavi: also happy to help. just ping us here I guess when you are ready and one of us will take the reviews [12:52:25] FIRING: [3x] SystemdUnitCrashLoop: purged.service crashloop on cp3074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:52:29] FIRING: SystemdUnitCrashLoop: purged.service crashloop on cp3067:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:55:40] hmmm we broke purged? [12:56:25] seems like it failed trying to connect to kafka [12:56:28] yep [12:56:33] nothing too bad [12:56:44] it's happy now [12:57:25] FIRING: [8x] SystemdUnitCrashLoop: purged.service crashloop on cp3066:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:57:25] FIRING: [8x] SystemdUnitCrashLoop: purged.service crashloop on cp3074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:58:30] sukhe: thank you, I have attached patches to T422040 [12:58:30] T422040: Migrate clouddumps https/rsync interfaces behind LVS - https://phabricator.wikimedia.org/T422040 [13:00:11] taavi: ok thanks. we will talk about it in our meeting today and assign reviewers [13:02:25] RESOLVED: [8x] SystemdUnitCrashLoop: purged.service crashloop on cp3074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:02:25] RESOLVED: [8x] SystemdUnitCrashLoop: purged.service crashloop on cp3066:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:03:07] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793667 (10Epidosis) Hi, this may be related to my import of data from GND into Wikidata via QS 3.0 which ran from April 1 to April 5 (https://w.wiki/KdP6). I thought Q... [13:04:41] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793682 (10magnusmanske) V2 should be fixed now [13:19:29] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793758 (10Xqt) For the record: The problems began on March 25th or 26th (see Grafana control panel), and it is still an issue currently because the minimum maxlag is 9... [13:28:47] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11793787 (10ABran-WMF) I forgot to add on my previous comment: ` # for i in $(fdfind --type f --change-newer-tha... [13:30:45] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11793801 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3c9a80b3-71a3-420a-bae4-d8cf79e5188e) set by ayounsi@cumin1003 for 0:30:00 on 3 host(s)... [13:53:20] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11793947 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b81e6903-f6db-4d47-bfbb-a5bff02b24fa) set by ayounsi@cumin1003 for 2:00:00 on 3 host(s)... [14:31:36] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11794192 (10RobH) Mainboard swap will occur on Wednesday, April 8th @ 10:00Singapore time which is Tuesday, Tuesday April 7th 18:00 Pacific. I'll be online for the duration of the work and to ensure... [14:32:48] 10netops, 06Infrastructure-Foundations, 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11794201 (10LSobanski) p:05Triage→03Low [14:32:50] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11794213 (10ayounsi) p:05Triage→03Medium [14:33:54] 06Traffic, 06Infrastructure-Foundations: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot - https://phabricator.wikimedia.org/T422261#11794216 (10LSobanski) p:05Triage→03Low [14:57:40] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11794425 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=918af1bd-6f87-4ec9-acb7-39622f38db7c) set by ayounsi@cumin1003 for 0:30:00 on 3 host(s)... [15:55:51] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525 (10cmooney) 03NEW p:05Triage→03Medium [15:56:01] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11794809 (10cmooney) [15:56:07] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11794810 (10cmooney) [16:12:47] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11794907 (10Vgutierrez) It looks like the root cause is [[ https://github.com/haproxy/haproxy/commit/0b7a5a64eb51ce4b22866064caa1c8e08ee17b8c | MEDIUM: log/session:... [16:47:16] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11795258 (10cmooney) [16:49:27] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11795284 (10cmooney) [18:09:19] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11795832 (10BCornwall)