[09:21:11] hello, I've got an unexpected result on a hiera lookup via a cookbook: https://phabricator.wikimedia.org/P79557 on both lookup I should have 2 different users, but my cookbook seems to only find a single user [10:03:25] looks to me there is only one string in for that var anywhere [10:03:31] however it is defined two places in the repo: [10:03:42] cmooney@wikilap:~/repos/puppet$ egrep -R ^profile::gerrit::daemon_user * [10:03:42] hieradata/common/profile/gerrit.yaml:profile::gerrit::daemon_user: 'gerrit2' [10:03:42] hieradata/hosts/gerrit2003.yaml:profile::gerrit::daemon_user: 'gerrit' [10:04:22] the common / default is 'gerrit2', but there is a specific override for host gerrit2003, so if it's evaluated for that host it will return just 'gerrit' [10:05:32] actually I'm not sure that makes sense, you get "gerrit" returned for both hmmm [10:08:49] I got removal of puppetserver2003 from homer cr*-eqiad* in the firewall stanza, that good to go? [10:09:35] ok found the decom SAL, commiting [10:12:55] claime: thanks yep seems safe, state is "decom" in netbox so homer will remove it [10:13:10] yeah all good, thanks for double checking [10:44:11] Hi, sorry to trouble again, but we're no further forward with T400037 at the moment. I was going to go ahead and reserve a `/18` container in here: https://netbox.wikimedia.org/ipam/prefixes/379/prefixes/ (as per discussion) but now that I look I think it is already fully allocated. Would `10.192.128.0/18` be OK to use? [10:44:11] T400037: Determine dse-k8s-codfw Kubernetes IP ranges - https://phabricator.wikimedia.org/T400037 [10:51:24] btullis: apologies I somehow got mixed up and fed back on another task altogether [10:51:48] \o/ ack, thanks. [10:51:51] let me dig in, 10.192.128.0/18 is 25% of the private space for codfw in total, so I'm not 100% we want to go for that off the bat [10:52:23] OK. [10:53:10] sorry the site allocation is a /12, that makes it easier [11:00:17] btullis: do you think the /18 is going to be needed for future growth? [11:01:17] if we allocate a /18 to cover a /20 and /21 the majority of the space is unused. that would make sense if we want to keep adjacent space for future growth [11:10:28] topranks: No, I think we would be fine with a smaller container than a `/18`- we don't have any current plans for any child prefixes other than the `/20` and `/21`. [11:13:06] btullis: ok great [11:13:15] I replied on task with what I think are sensible allocations we can make [11:13:22] we will also need to allocate an ASN I think [11:20:07] Oh yes, I see. We have an ASN for the corresponding cluster in eqiad. https://netbox.wikimedia.org/ipam/asns/55/ - Do I just create the next in the sequence in codfw? [12:14:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:18] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159 (10Jclark-ctr) 03NEW [13:46:32] urgh it was a typo -_- [14:12:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161 (10cmooney) 03NEW p:05Triage→03Low [14:53:46] dear foundations, I am trying to create this mtail black magic https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171562 and I am getting a CI failure [14:53:46] https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/14400/console [14:54:20] and I can't make heads or tails of it. I have tried the mtail script and I know that it works, and, to my understanding, the test works too [15:15:16] jhathaway: o/ I have a follow up question for the nvme use case if you have time, but please don't spend too much time on it, just lemme know if you have ideas :) In https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167873/4/modules/install_server/files/autoinstall/partman/hwraid-1dev-nvme.cfg I added the recipe for the 1dev use case, since sretest2006 is a dell with a boss card and two nvmes in raid1. Same issue, no efi [15:15:16] partition found [15:15:42] earlier on I tried a stupid fix and reverted, I was over excited by your patch :D [15:15:55] but thinking a bit more, I don't understand why the 1dev recipe fails [15:16:14] maybe we need to add extra config to instruct d-i where the efi partition is? [15:16:37] how does the boss card show up? /dev/boss? [15:19:17] jhathaway: afaics from d-i /dev/nvme et.. [15:19:28] like it was a single nvme device [15:20:05] effie: o/ the only thing that I wonder is where "KeyError: key not found: "PARALLEL_PID_FILE" comes from, since it seems a Python exception.. maybe from the tests that you added? [15:20:51] I understand it is some CI thing going bad [15:21:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11024611 (10Jhancock.wm) I'll trash the optic. Good to close if there are no other points to cover. [15:24:41] effie: and if you recheck it repro consistently right? [15:25:31] ie bump it and let CI run again ? [15:25:42] actually no :) [15:29:28] grand, it is not producing the same error [15:37:21] I may have found it [15:37:25] stupid flake [15:38:00] goood [15:38:14] I ran flake locally and spilled some more secrets [15:38:36] but def now somehing found in the output on CI [15:38:48] tx luca! [15:45:45] elukey: nothing is jumping out at me with regard to sretest2006 [15:45:55] the partitions are not being created [15:46:15] there should be three partitions, but only one large partition is being created, which is strange [15:48:12] nothing obvious in the installer logs, though they are not the easiest to read [15:51:23] jhathaway: okok perfect, I'll keep digging, it is not urgent just a new test server! [15:51:26] thanks a lot! [15:52:29] of course, let me know if you want me to dig further, debugging partman is always difficult [15:58:44] I am back on sq 1 https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/14411/console. It is something silly I can't put my finger on [16:03:18] hello I/F friends - following up here after discussion with elukey: [16:03:18] * as part of the PHP 8.3 upgrade, we'd like to switch from the unmaintained tideways extension [0] (being removed from debian, not present in trixie) to xhprof [1] (recommended alternative; not yet included in debian, but available on sury.org) [16:03:18] * both derive from the same codebase historically, but have diverged over time. I've reviewed the package source code and have no concerns. [16:03:18] * however, while m.oritzm is out, it's unclear to me if there's a more concrete process for vetting new packages for use at WMF. if anyone has input, that would be greatly appreciated. [16:03:18] [0] https://salsa.debian.org/php-team/pecl/tideways [16:03:18] [1] https://salsa.debian.org/php-team/pecl/php-xhprof [16:03:40] more detail on this saga is available in https://phabricator.wikimedia.org/T398245 :) [17:00:01] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#11025130 (10jhathaway) >>! In T378028#10995390, @Arnoldokoth wrote: > Thanks @MoritzMuehlenhoff We'll consider that... But I'm doubtful... [17:04:38] swfrench-wmf: thanks for the detailed information, I am not aware of a more detailed vetting process, I think your approach is fine. Do you know if their are any plans to have xhprof in debian proper? [17:06:29] also as a selfish aside, any chance my mail patches could be part of the package? https://phabricator.wikimedia.org/T360995#10942948 [17:14:37] jhathaway: thank you very much! I don't know off hand what the plan is for php-xhprof ending up in debian, but I can check around a bit. [17:14:37] oh, and yes - I've backported your mail fixes to 8.3 and hope to do the same for 8.1 at some point soon. [17:30:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11025252 (10cmooney) 05Open→03Resolved [18:53:53] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for some gnmic sourced metrics in codfw - https://phabricator.wikimedia.org/T400205 (10cmooney) 03NEW p:05Triage→03Medium [18:57:55] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for some gnmic sourced metrics in codfw - https://phabricator.wikimedia.org/T400205#11025679 (10cmooney) Hmm so looking a bit closer the issue seems to be counters on cr2-codfw itself ` cmooney@re0.cr2-codfw> show interfaces xe-0/1/1:1 |... [18:58:12] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025680 (10cmooney) [18:58:42] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025681 (10cmooney) [19:01:45] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025685 (10cmooney) >>! In T400205#11025679, @cmooney wrote: > Perhaps some odd bug to do with the new MPC10E card? This possibly? https://supportportal.juniper.... [19:08:39] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats repoted by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025713 (10cmooney) [19:09:34] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025714 (10cmooney) [19:21:56] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025744 (10cmooney) The linked PR on the Juniper site says it was fixed in 23.4R1, we are on 23.4R2, so in theory shouldn't be it. I guess we could try the same fix, probably th... [22:04:11] 10Mail, 10Infrastructure Security, 06Infrastructure-Foundations, 06SRE: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org postfix mailing list - https://phabricator.wikimedia.org/T396062#11026117 (10jhathaway) 05Open→03Resolved a:03jhathaway ysuu9wx7@ag.us.dmarcian.com has been ad... [22:10:53] 10Mail, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11026153 (10jhathaway) >>! In T394788#11017745, @nisrael wrote: > Hi SRE team, > > Checking in on this task. Do you have an approximate...