[12:56:55] Stupid Puppet Question - why is a puppet7 CI saying "Error: Evaluation Error: Unknown function: 'index'.", when it's listed at https://www.puppet.com/docs/puppet/7/function.html#index ? [and, relatedly, am I allowed to use puppet7 only features?]; alternatively, what should I be doing instead? [12:57:45] I guess, also, what is "fail fast" and how do I find out why a host is doing that? [12:59:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175874 is the change in question [13:01:48] <_joe_> Emperor: looks like you're using an illegal syntax for index [13:02:23] <_joe_> specifically it's a function like .each that expects an execution block [13:02:33] <_joe_> what were you trying to do there? [13:03:03] the puppet docs have an example 'notice $data.index('servers') # notices 1' which I thought was analogous to what I wanted [13:04:13] _joe_: $serials is an array of strings, one of which should match the first regex matching group; I want the index of the match [13:05:36] i.e. $1 should be one of the members of the array $serials, and I want the index number of that member [13:07:26] <_joe_> so [13:07:34] <_joe_> puppet apply -e '$a = ["a", "b", "d", "c"]; $b= $a.index("d"); notice($b)' works [13:07:42] <_joe_> on a ms-be host [13:07:52] <_joe_> I wanted to verify the problem wasn't puppet itself [13:16:31] <_joe_> Emperor: something else must be wrong, because I just ran [13:16:34] <_joe_> puppet apply -e '$jbod_re = /exp0x([0-9a-z]+)-phy(\d+)-lun/; $serials = ["213", "214", "12a"]; $drive = "exp0x12a-phy214-lun"; if ( $drive =~ $jbod_re ) { $ser_num = $serials.index("${1}") }; notice($ser_num)' [13:16:52] <_joe_> on a production host, and that returns "2" as you'd expect [13:18:21] :( [13:18:51] I added some more debug-by-err and it looks reasonable (until the confusing error about index not existing) - https://puppet-compiler.wmflabs.org/output/1175874/7160/ms-be1091.eqiad.wmnet/change.ms-be1091.eqiad.wmnet.err [13:19:05] <_joe_> where is the puppet error? [13:19:09] <_joe_> ah thanks [13:19:25] <_joe_> Error: Scope(Class[Profile::Swift::Storage::Configure_disks]): 500304801ff9b73f [13:19:33] <_joe_> at some point "serials" becomes a single string [13:19:50] _joe_: no, that's the extra 'err $1' I added in the most recent change-set [13:19:57] <_joe_> hah [13:20:00] <_joe_> ok [13:20:09] sorry, was trying to check that it was sensible (and present in $serials) [13:22:45] <_joe_> did you try to use "${1}" already in the index function, right? [13:24:08] yes, I did that previously, same problem, then remembered that you weren't meant to quote variables where not interpolating them into strings, so tried $1 instead [13:24:47] <_joe_> puppet apply -e '$jbod_re = /exp0x([0-9a-z]+)-phy(\d+)-lun/; $serials = ["500304801ff9b73f", "500304801ffa4e3f"]; $drive = "exp0x500304801ff9b73f-phy2311-lun"; err $serials; if ( $drive =~ $jbod_re ) { $ser_num = $serials.index($1) }; notice($ser_num)' still works [13:24:52] <_joe_> returns "2" as notice [13:25:12] <_joe_> I'm running it on a host though, not trying to compile it [13:25:48] <_joe_> but, I'd try to repro it in your dev env [13:25:59] <_joe_> it's also possible something's broken in the puppet compiler [13:28:22] Given my general inability to write puppet, I will be surprised if it's the compiler (and also sad) [13:29:44] it's presumably not a variant of T366387 (which relates to the version of PuppetDb PCC is running) [13:29:45] T366387: PCC throws evaluation error on valid code - https://phabricator.wikimedia.org/T366387 [13:40:20] a general inability to write puppet may be a sign of intelligence :) [13:42:53] AFAICT we use .index exactly once in the codebase already in modules/profile/manifests/gnmi_telemetry.pp [13:46:46] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1154303 passed on p7 not on p5 (as one might expect). Is there any harm to re-running PCC on that (merged) CR? [13:47:17] <-- clutching at straws somewhat [13:58:43] <_joe_> bblack: or of denial [13:59:46] _joe_: FWIW, I re-ran the CI on that other change, and it now fails too with the same error: https://puppet-compiler.wmflabs.org/output/1154303/7162/netflow7002.magru.wmnet/change.netflow7002.magru.wmnet.err [14:00:20] <_joe_> Emperor: so you're saying it's a I/F problem, gotcha [14:00:20] <_joe_> :D [14:01:11] looking back through that CR, PCC did run OK on patchset 14 on 10 June [14:01:45] _joe_: um, well, that would support the "PCC is not working properly" rather than "Emperor cannot write puppet" in this particular case. [14:02:18] <_joe_> that's what I suggested earlier, yes [14:02:48] I'll open a ticket about that, but it leaves me in a slightly awkward position about what to do with my CR [14:11:04] Lucas_WMDE we just finished a reload on wdqs1022 (ref T386098, T384344). Do you have access to verify the changes worked? If there's a query I can run to verify LMK. cc dcausse [14:11:04] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [14:11:05] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:11:42] uhhhh [14:12:27] inflatador: will run a query on the node to check [14:12:46] wasn’t T384344 supposed to have no effect on WDQS? [14:13:18] I think dcausse knows better how to test these servers 😅 [14:13:59] Lucas_WMDE: yes, will do a quick check, the some value hash has changed, will just check a couple to see if they match what's inside wikidata and that should be enough [14:15:01] ok, sounds good [14:15:09] Thanks Lucas_WMDE and dcausse ! If I can help LMK [14:15:31] the other check I was thinking now was just number of triples, given that the reload was shorter than expected (https://phabricator.wikimedia.org/T386098#11060911) [14:15:38] just to make sure it actually loaded a reasonable amount of data [14:16:36] although I guess https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=panel-7 shows that already [14:18:01] yeah, I was thinking the number of triples is close enough, but if that's not the case LMK [14:18:19] * Emperor opens T401203 [14:18:34] (apropos the PCC 7 index failure) [14:19:03] * Lucas_WMDE suspects stashbot doesn’t respond to task ID in /me messages [14:19:07] hm... indeed, we don't delete orphaned triples so some difference is expected... but 20M seems like a lot... [14:19:56] if some hashes changed, wouldn’t the “old” servers have both old and new hash nodes, at least for items that received an update? [14:20:46] true but could that explain 20M triples, no clue.. [14:21:42] well... will run some checks and we'll find out :) [14:23:04] 👍 [14:32:11] Emperor: seeing your messages now, how can I help? [14:33:44] Hello. I'm getting a netbox/DNS related error when running some cookbooks. I wonder if anyone has any guidance, please. [14:33:57] The final error message is: `FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dns-check.ynnrpx65/zones/netbox/5.65.10.in-addr.arpa'` [14:34:45] Tracking back, this was shown: `Exception: Command python3 ./utils/zone_validator.py --error --zones_dir /tmp/dns-check.ynnrpx65/zones --ignores MISSING_ASSET_TAG,MISSING_MGMT_FOR_NAME,TOO_FEW_MGMT_NAMES failed with exit code 1, stderr:` [14:37:03] I saw it when running `sre.hosts.decommission`and now again when running `sre.hosts.rename` [14:37:49] I am in a meeting [14:38:10] so can't fix it but what is probably happening is that there are no IPs in [14:38:13] $ORIGIN 5.65.@Z [14:38:16] ; See https://wikitech.wikimedia.org/wiki/DNS/Netbox [14:38:18] $INCLUDE netbox/5.65.10.in-addr.arpa [14:38:59] so that include needs to be removed [14:39:02] OK, thanks. Should I remove that $INCLUDE statement from the dns repo? [14:39:13] Ack. Got it. [14:39:30] yes please [14:39:33] I still have to check it though [14:39:39] topranks: ^ can you check in the meantime if around [14:41:23] jhathaway: hi - I opened T401203 with a summary, but it _looks_ like PCC for puppet 7 is broken regarding the index function [14:41:24] T401203: PCC 7 regression "Error: Evaluation Error: Unknown function: 'index'" - https://phabricator.wikimedia.org/T401203 [14:41:46] jhathaway: alternatively, please tell me what I did wrong in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175874 :-D [14:41:55] :) [14:42:59] I replied on task, but it seems like only the Puppet5 PCC run is faililng? [14:43:05] topranks: 1175909: Remove the netbox/5.65.10.in-addr.arpa zone from list of includes | https://gerrit.wikimedia.org/r/c/operations/dns/+/1175909 [14:43:07] *failing [14:44:06] jhathaway: that's not what I see at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/7162/console [14:44:27] jhathaway: you look to have linked to an earlier puppet 7 run ? [14:44:37] ah, I may have, re-looking [14:45:12] the one you just linked is puppet 5, https://puppet-compiler.wmflabs.org/output/1154303/7162/ [14:46:07] maybe I am just confused by the gerrit check outputs [14:46:09] * Emperor looks again [14:46:34] btullis: which host did you decomission? [14:46:57] jhathaway: oh, duh, I am too stupid, I thought "experimental" was the p7 check, but actually p7 gets squirreled away under "info" [14:47:03] sukhe: I decommed snapshot1010 and dumpsdata1003 [14:47:38] Emperor: you're definiltey not dumb, and I would assume experimental is the p7 check as well, I talk to hashar about improving the naming [14:47:40] but now that I check, neither of those was in 5.65.10.in-addr.arpa [14:47:40] jhathaway: wait, why does it say 'operations-puppet-catalog-compiler-puppet7-test | Experimental build failed. | 48s ' ? [14:48:03] In https://gerrit.wikimedia.org/r/c/operations/puppet/+/1154303?tab=checks [14:48:12] checking [14:48:24] but as you say, below that the pcc7 link goes to https://puppet-compiler.wmflabs.org/output/1154303/4648/ which looks like success [14:49:10] that is strange, but that link shows, "Finished: SUCCESS" at the bottom, perhaps the exit code is wrong [14:49:25] should be fixed, not sure if I have seen that before [14:49:57] meaning, I'll open a task to look into that [14:50:26] jhathaway: thanks, I'll close T401203 as invalid, and sorry for the noise [14:50:27] T401203: PCC 7 regression "Error: Evaluation Error: Unknown function: 'index'" - https://phabricator.wikimedia.org/T401203 [14:50:41] btullis: reviewing shortly [14:50:42] sorry, meeting [14:50:52] Emperor: not all, clearly there are some issues [14:50:58] Thanks. Not urgent. [14:53:18] jhathaway: I've closed that task, but also uploaded a screenshot which I think shows why I got confused [15:00:41] Emperor: thanks [15:03:11] btullis: can you confirm [15:03:25] an-conf1002, 1003 also being decomissioned? [15:04:16] sukhe: Yes, m'colleagu stevemunene decommed those in T398013 [15:04:17] T398013: decommission an-conf100[1-3] - https://phabricator.wikimedia.org/T398013 [15:04:20] ok [15:04:28] just checking what IPs are being removed [15:11:32] btullis: +1ed [15:14:20] btullis: I am going to go ahead and merge it [15:14:25] since this blocks other stuff [15:14:41] Thanks, that's great. [15:22:19] btullis: all done [15:22:27] you can resume from where you were, should be fine [15:22:29] please ping me if not [15:22:36] (I have tried running both) [15:24:46] That's great, thanks. Yep, `retry` in the cookbook seems to be working now. [15:25:20] great. thanks for letting us know! [15:36:47] github seems to be having some issues [16:09:46] <_joe_> bblack: indeed I'm getting 500s pretty consistently [16:10:16] https://www.githubstatus.com/ yeah looks down [18:36:01] Hello, all! The Abstract Wikipedia team is dealing with some issues around resource limits, particularly 1) request size and 2) memory. I'm not able to see where those limits are configured. Is that something we're able to see? And is this something we can change, or could SRE help us with that? [19:09:50] apine forgive my ignorance as I don't know anything about Abstract Wikipedia. Do you know if your application runs in the main wikikube cluster? [19:11:07] I'm not sure--what is the main wikikube cluster called? [19:11:53] apine I know it as wikikube, although it probably has other names. It looks like Wikifunctions run there ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/wikifunctions/ ) , is that the application you're concerned about? [19:12:27] More info on k8s clusters here: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters [20:36:03] swfrench-wmf: I've added links to several tasks at https://wikitech.wikimedia.org/wiki/User:Krinkle/PHP_Upgrade_Process#Upgrade_from_PHP_8.1_to_PHP_8.3 [20:50:31] I should probably put it as markdown on the tracking task instead. [20:50:44] Might do that tomorrow, more visible to others that way. [20:56:35] Yes, that's the one! (@inflatador: I know it as wikikube, although it probably has other names. It looks like Wikifunctions run there ...) [20:59:15] Krinkle: ah, thanks for doing so! I'll take a look. I like the additional context possible in wiki form, so it might also be worth considering referring to the page from the 8.3 tracking task (as an alternative to trying to convert it to phab markup) [21:02:50] yeah, there'll be more context over time too. But I've destilled them into a short checklist below that which we could "instantiate" on the task. [21:03:13] hoping to add more context as we go which can be on the wiki page only, don't want to duplicate/spread that indeed [21:05:04] sounds good [21:07:19] apine ACK, I'm taking an educated guess but it looks like wikifunctions runs in the `mw-wikifunctions` namespace on wikikube. Normally I'd point you to https://grafana.wikimedia.org/goto/6j-GQglNR?orgId=1 to view your namespaces's memory/CPU quota but I don't see it there? [21:09:01] you may need to check with ServiceOps (SRE team that owns wikikube). I could help with the quota stuff assuming it's not too crazy, but the request size stuff will have to go to them anyway [21:55:51] apine: is there a task that describes the issues you folks are experiencing? that might be helpful in narrowing down how we can best help you