[08:05:22] good morning! [08:06:26] yo! [08:06:32] while checking https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview's disk IO per host I noticed that read/writes have all positive values, but the y-axis says (writes - / reads +) [08:07:11] so I am going to add a -1 * to the metrics config, writing in here as FYI (same config used for other panels in the same dashboard) [08:08:43] (if I find how) [10:36:51] hmm cloudelastic1005|1006 are alerting on their TLS cert going to expire soon (6 days) but the certificate has been renewed as expected [10:36:56] root@cloudelastic1005:/etc/acmecerts/cloudelastic/live# openssl x509 -dates -noout -in /etc/acmecerts/cloudelastic/live/ec-prime256v1.crt [10:36:56] notBefore=Oct 2 19:00:36 2020 GMT [10:36:56] notAfter=Dec 31 19:00:36 2020 GMT [10:37:33] it looks to me like those two specific cloudelastic nodes aren't reloading a service on cert changes as they should [14:09:35] rzl: by any chance did you run a dry-run of the switchover? Asking as I didn't get any spam for a --live-test [14:09:50] no, planning to run one later today [14:10:07] I don't expect to learn much, we haven't really touched it since the last wet run [14:46:53] What time is the switchover scheduled for tomorrow? (I'm sure it's on a calendar someplace but none of the ones I've checked so far) [14:47:23] https://wikitech.wikimedia.org/wiki/Switch_Datacenter [14:47:25] MediaWiki: Tuesday, October 27th, 2020 14:00 UTC [14:48:42] thx [14:48:48] ^ yep, prework starts at 13:30 UTC, read-only period should start at or shortly after 14:00 [14:59:36] https://blogs.gnome.org/hughsie/2020/10/26/new-fwupd-1-5-0-release/ mentions that Broadcom BCM5719 NIC firmware updates are going to be supported in the future, which will AFAICT be the first piece of our server gear supported by fwupd [15:10:00] moritzm: oh nice [15:10:42] if only fwupd's dependency tree was a bit slimmer... :) [15:11:29] did someone accidentally run remote-backup-mariadb on cumin1001? [15:11:47] we have some half-done backups but we don't know the source of [15:12:26] or maybe some work related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/636410 jbond42? [15:13:23] I guess it doesn't log who called it :) [15:13:37] volans: that is the issue [15:13:49] I think the disabling of logs disables both app and cumin logs [15:13:58] so I don't have a trace [15:14:01] jynus: yes that was me sorry i thought i had cleaned up [15:14:10] no problem jbond42 this is great news [15:14:25] I though we had a weird remote execution bug, which would have been scary [15:15:33] handling remote execution is hard, because it is not easy to respond perfectly to sigint [15:15:50] (should it kill the remote pid? all its children too? [15:16:15] the other thing that informs me is I should enable logs back asap [15:16:35] jynus: if there is anything still left from me then yes kill it please [15:16:43] no kill [15:16:57] the backups left temporary files behind as they didn't fully complete [15:17:14] and a couple of servers with stopped replication [15:17:31] will manage that now, we have at least good monitoring on that regard :-D [15:18:41] :) ack thx [15:19:03] also, I am happy that jbond42 was able to create 2 full backups of enwiki and wikidata so fast and easily :-D [16:09:31] puppet on alert1001: Server Error: [413 Request Entity Too Large] [16:09:40] >nginx/1.14.2 [16:10:44] indeed [17:02:05] puppet on alert1001 fixed after https://gerrit.wikimedia.org/r/c/operations/puppet/+/636459 [18:14:33] does anyone know if the "mediawiki originals uploads -hourly- for eqiad" is giving a warning about too little number of originals uploads or too large? It is not clear from the graph [18:15:26] too many [18:15:43] too many new uploads [18:16:04] strange, it doesn't look like a normal level, compared to last week [18:16:09] *an abnormal [18:17:25] it has bee as large 12 times in the last week [18:18:56] although preciselly commons wiki on eqiad starts lagging again now [18:20:29] ah, I see it is a rolling-window graph (now - 1h) so the change is more significant [19:38:54] mutante: is the pcc working for you today on cloud nodes? I'm seeing a lot of new complaints about profile::base::labs::unattended_wmf which I'm pretty sure isn't due to the patch I'm testing... [19:39:39] oh sorry I keep forgetting you're not in the US [19:41:43] andrewbogott: I am in the US [19:41:58] I have not tried compiling today on a cloud node.. but let me try [19:42:01] thanks [19:44:43] yes, something looks broken... [19:44:58] but i don't see the profile you mention in it [19:44:59] yet [19:45:40] interesting part is the "prod errors" show only warnings but still fai [19:45:41] l [19:46:43] hm, yeah, I tried on a different VM and it also failed but for different reasons [19:46:51] I wonder if there's a general hiera lookup failure happening [19:47:00] andrewbogott: it's like a change I wanted to make: https://gerrit.wikimedia.org/r/c/operations/puppet/+/633838/2/modules/profile/manifests/ldap/client/labs.pp [19:47:07] except.. that is not merged... [19:47:40] somewhat related to https://phabricator.wikimedia.org/T101447 [19:48:10] "restriced_to" and _from got removed from Hiera ? [19:48:24] maybe merging my change fixes it :) [19:49:44] we can try it :) [19:50:03] profile::ldap::client::labs has not been touched recently .. hmm [19:50:57] and that is also the only place trying to do a lookup for profile::ldap::client::labs::restricted_to [19:51:00] weird [19:51:17] seems like it must have been a change in hiera [19:51:59] also it says "function lookup" and that is still hiera() before my change..wtf [19:52:12] or somebody cherry-picked my change? [19:53:10] I don't see anything especially wide-sweeping in recent puppet history [19:53:57] restricted_to does not show up in ./hieradata/ [19:54:13] can you see Hiera changes in other backends? [19:55:36] mutante: I commented on that patch [19:55:43] but why would it be related? It's no merged yet... [19:55:49] is that what you're using to test the compiler? [19:56:34] andrewbogott: thanks, and I'm glad to merge it, and yes, that's what I used to test it.. but the error does not seem to match [19:56:57] can you see if something got cherry-picked on cloud puppetmaster? [19:57:18] the puppet compiler doesn't use cloud-puppetmasters does it? [19:57:40] (there aren't any cherry picks) [19:57:45] compiling with something unrelated [19:58:20] I don't think cloud instances can talk to prod puppetmaster [19:58:47] andrewbogott: ok..news.. if i compile with some other random change.. compiler gives a NOOP and seems to work [19:58:48] right but the puppet compiler is compiling locally with a fresh checkout of git [19:58:50] as I understand it [19:58:54] hm... [20:00:32] so you want to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/636476/ right? [20:00:36] looking at that one [20:01:49] andrewbogott: it says it is missing a value for legacy_cloud_search_domain [20:02:18] sounds like you're looking at gerrit warnings [20:02:27] which is different from the puppet compiler thing [20:05:11] hm… something just got better [20:05:23] I still can't explain the errors I saw before, but... [20:05:34] I got a failure page on the pcc pages and now it works. Possibly related :) [20:06:31] mutante: I don't understand your comments (or, really, the error message gerrit was giving). There is a profile::base where I do the hiera lookup [20:06:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/636476/8/modules/profile/manifests/base.pp [20:06:53] andrewbogott: I was able to compile your change on a random VPS and now: https://puppet-compiler.wmflabs.org/compiler1001/26137/wikistats-dancing-goat.wikistats.eqiad.wmflabs/index.html [20:07:11] yeah it works for me now too [20:08:58] andrewbogott: I saw the new parameter in class base::resolving ..that triggered my comments.. but they are wrong.. I am removing them [20:09:52] comments removed [20:10:21] 'k thanks [20:10:36] I still don't know what was happening before but results seem reasonable now. Does the compiler like your patch now too? [20:12:51] andrewbogott: no, it does not like it :p [20:13:02] so it's consistent, somehow [20:13:46] yea, so.. it tries to do the hiera lookup for restricted_to _from all the time [20:13:55] except after my change there is no default to fall back to [20:14:32] so either it needs a different default that does not cause warnings itself [20:14:44] that's ok, you can just add hiera defaults (either inline or in cloud.yaml) [20:14:47] of [] [20:15:09] ok, let me add that [20:17:17] so these used to be arrays? [20:17:40] checking [20:18:19] hm, nope, I guess it's just a string [20:18:34] used like this: [20:18:36] https://www.irccloud.com/pastebin/1yDM4j65/ [20:19:50] ok.. setting that to an empty string seems like it would break that though [20:19:59] how so? [20:20:17] oh I see, hang on, let me paste some context :) [20:20:32] https://www.irccloud.com/pastebin/rERWjYgU/ [20:20:35] it checks first [20:21:24] I see.. then it should probably be "Boolean or String" and default_value false [20:22:21] trying to compile it with a default of false [20:33:08] andrewbogott: it likes this: https://puppet-compiler.wmflabs.org/compiler1002/26141/wikistats-dancing-goat.wikistats.eqiad.wmflabs/index.html [20:33:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/633838/4/modules/profile/manifests/ldap/client/labs.pp [20:38:51] cool [20:40:35] should I merge it now then? [21:44:36] I forgot there was the live test, I looked in -operations and was like "oh, we're doing it now? huh. ok then" :-D [21:45:16] ahaha sorry [21:45:25] I tried to put a warning before and after, but everything in between is so noisy that it's easy to miss [23:31:15] yeah well I missed it because I was just doing a drive-by check of the channel, got what I deserved :-D