[00:31:37] 06Traffic, 06Abstract Wikipedia team: Server timeout for some pages on Wikifunctions - https://phabricator.wikimedia.org/T374657#10142973 (10Bugreporter) →14Duplicate dup:03T374305 [06:36:18] 06Traffic, 10conftool, 07Epic: Extract an api class for requestctl - https://phabricator.wikimedia.org/T373449#10143199 (10Joe) 05In progress→03Resolved [08:01:51] 06Traffic, 06Data Products, 06Data-Engineering, 13Patch-For-Review: Prepare puppet configuration to send haproxy logs to haproxykafka socket - https://phabricator.wikimedia.org/T374473#10143322 (10gmodena) @fabfur thanks for the heads up. We'll need to coordinate a bit on roll out, because the previous ing... [09:08:48] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10143412 (10cmooney) 05Open→03Resolved a:03cmooney [09:17:48] 10Wikimedia-Apache-configuration, 06MW-Interfaces-Team, 07Regression: After introduction of /api/ in ATS, https://en.wikipedia.org/api/ returns 404 Not Found - https://phabricator.wikimedia.org/T373998#10143466 (10akosiaris) >>! In T373998#10141234, @daniel wrote: > @akosiaris Did I understand correctly... [09:42:31] 06Traffic: Enable prometheus metrics scraping for haproxykafka - https://phabricator.wikimedia.org/T374696 (10Fabfur) 03NEW [09:49:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp2037:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2037 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:50:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp2029:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2029 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:54:00] FIRING: [3x] PurgedHighBacklogQueue: Large backlog queue for purged on cp2029:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:54:59] hmmm [09:55:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp2029:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:55:21] jayme ^^ [09:55:31] wut [09:55:56] it seems like codfw has 3 days of events :_) [09:56:24] even if those nodes consumed the data on eqiad, sigh [09:56:32] let me stop puppet on those nodes ASAP [09:57:16] you mean it interprets the events as new even though they have already been consumed from kafka in eqiad? [09:57:21] yes [09:57:23] ouch [09:58:59] I don't know enough to really speak to it I guess... [09:59:00] FIRING: [4x] PurgedHighBacklogQueue: Large backlog queue for purged on cp2029:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [10:00:00] FIRING: [10x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:01:43] jayme: any idea on how to mitigate this? [10:03:37] vgutierrez: not really, _joe_ maybe? [10:04:00] FIRING: [6x] PurgedHighBacklogQueue: Large backlog queue for purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [10:04:34] <_joe_> I guess you could move the client offset forward [10:05:00] FIRING: [12x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:05:05] <_joe_> vgutierrez: it it increasing? [10:05:25] I disabled puppet and I'm reverting at the moment [10:05:40] <_joe_> it won't solve anything if they already started consuming from codfw [10:05:43] <_joe_> I fear [10:05:45] <_joe_> uhm [10:05:52] ugh [10:06:01] <_joe_> vgutierrez: acvtually no, go on and revert for now [10:07:03] yup.. event lag went back to 0 [10:07:33] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10143593 (10cmooney) Just a note to say myslef and Jeff did some work on this yesterday in codfw and we now have B... [10:07:53] so we need a way of skipping that chunk of events, or I need to perform the switch in a staged way [10:08:50] purged is too aggressive to let it handle 3 days of events [10:09:00] FIRING: [12x] PurgedHighBacklogQueue: Large backlog queue for purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [10:10:00] RESOLVED: [12x] PurgedHighEventLag: High event process lag with purged on cp2031:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:10:33] given that eqsin wasn't on codfw before I can move eqsin without any issues [10:13:50] 10netops, 06Infrastructure-Foundations, 06SRE: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10143598 (10cmooney) Talking about this again I'm ok with the revised plan, with allocations similar to our POP sites. So for instnace for codfw we can probably move ahead on this basis: * 2a02:ec8... [10:14:00] RESOLVED: [14x] PurgedHighBacklogQueue: Large backlog queue for purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [10:16:43] vgutierrez: so you think purged will actually re-purge all the pages if we just let it run? Or does it somwhow sanity check and just has to chew through all the events? [10:17:31] jayme: it will skip every event older than 24h [10:17:45] ah [10:17:54] https://www.irccloud.com/pastebin/ot7xgHgf/ [10:17:55] could we let it run on one cp host then? [10:18:06] jayme: we have one group per cp host [10:18:12] damn [10:18:16] see the example for cp2037 [10:19:31] why did it work when switching from codfw to eqiad then? [10:19:40] new groups in eqiad [10:19:48] ah [10:20:10] but the codfw hosts were already aware of groups for cp hosts in codfw and ulsfo [10:21:16] in theory it should be as easy as reset offsets (--to-current) [10:23:45] sounds promising at least [10:24:17] hmm current or log-end? [10:25:13] log-end apparently [10:27:25] thats something we could try for one cp host then, right? [10:28:23] yep [10:28:38] I'll get the CR ready for cp2037 [10:29:13] stupid question, offset is cluster bound or topic bound? [10:29:31] so same topics in the two clusters have the same log offset? [10:29:33] 06Traffic: haproxykafka features - https://phabricator.wikimedia.org/T374128#10143657 (10Fabfur) [10:31:17] I would *assume* its in sync because mirror maker makes it so - but I'm really out of my depth here. My kafka knowledge is super limited unfortunately [10:33:02] but wouldn't you set the offset to the latest one anyways and then switch purged to codfw? Worst that might happen is that it processes some events twice (delta between setting offset to latest and restarting purged?)? [10:33:47] yep [10:34:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072720 [10:37:48] it should be as running vgutierrez@kafka-main2006:~$ kafka-consumer-groups --bootstrap-server localhost:9092 --group cp2037 --reset-offsets --to-latest before merging [10:37:57] jayme: looks good to you? [10:38:42] vgutierrez: yah [10:38:49] *yeah [10:39:17] actually --execute is needed :) [10:39:40] and I need --all-topics [10:40:18] so kafka-consumer-groups --bootstrap-server localhost:9092 --group cp2037 --reset-offsets --to-latest --all-topics --execute [10:40:45] ok.. that worked, we got a sane lag [10:41:01] cool [10:41:34] switching cp2037 now :) [10:44:28] <_joe_> we should probably make this a cookbook or at least document it on the purged page [10:46:54] yup [10:47:10] i'll take care of switching and documenting this [10:50:31] thanks valentin! [10:59:42] hmm actually https://doc.wikimedia.org/spicerack/master/api/spicerack.kafka.html#spicerack.kafka.Kafka.transfer_consumer_position [12:41:26] DRY-RUN: END (PASS) - Cookbook sre.cdn.transfer-purged-positions (exit_code=0) rolling custom on P{cp203[4-6]*} and A:cp [12:41:40] test-cookbook seems to be happy [12:42:20] could I get some reviews on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1072736 :? [12:43:03] extended log of the test run available on cumin1002:~vgutierrez/cookbooks_testing/logs/sre/cdn/transfer-purged-positions-extended.log [12:44:05] 2024-09-13 12:40:54,781 DRY-RUN vgutierrez 236377 [INFO] Extracted timestamps from source cluster "main", site "eqiad" and consumer group "cp2035". [12:44:12] 2024-09-13 12:40:56,661 DRY-RUN vgutierrez 236377 [INFO] Offsets approximated and set for target cluster "main", site "codfw" and consumer group "cp2035". [12:44:15] 10netops, 06Infrastructure-Foundations, 06SRE: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712 (10aborrero) 03NEW [12:47:05] fabfur: can you take a look given that it's inspired in one of your cookbooks? [12:47:47] inspired on one of my cookbooks? that sounds awful [12:47:52] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713 (10aborrero) 03NEW [12:48:21] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714 (10aborrero) 03NEW [12:49:47] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715 (10aborrero) 03NEW [12:50:25] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10144206 (10aborrero) [12:50:28] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10144207 (10aborrero) [12:50:37] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495#10144208 (10aborrero) [12:55:01] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144214 (10aborrero) [12:55:29] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144223 (10aborrero) [12:56:05] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716 (10aborrero) 03NEW [12:56:19] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10144237 (10aborrero) [12:56:22] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10144238 (10aborrero) [12:56:28] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144239 (10aborrero) [12:57:11] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144245 (10aborrero) [12:57:33] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144241 (10aborrero) [12:57:35] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144246 (10aborrero) [13:02:53] fabfur: :| [14:18:53] if someone has a chance to look at this patch it would be appreciated, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072566 [14:19:49] jhathaway: will do shortly [14:20:04] much appreciated! [15:29:00] 06Traffic, 10conftool, 07Epic: Coexistence of the requestctl CLI tool and of the web interface - https://phabricator.wikimedia.org/T374723 (10Joe) 03NEW [15:39:37] 06Traffic, 10conftool, 07Epic: Coexistence of the requestctl CLI tool and of the web interface - https://phabricator.wikimedia.org/T374723#10144661 (10Joe) I would personally much prefer the second option, and I'd rather focus on keeping a good audit log one can find on logstash rather than working to keep t... [15:41:45] 06Traffic, 10conftool, 07Epic: Create simple web view of requestctl status - https://phabricator.wikimedia.org/T371782#10144670 (10Joe) >>! In T371782#10114735, @TBurmeister wrote: > Hi @Joe, do you have a screenshot or any mockups of what the web UI for this tool will look like, or a list of all the functio... [18:34:39] win 14