[06:08:18] hello folks [06:08:33] icinga seems to be a christmas tree, even if nothing super horrible is happening :D [06:20:28] I cleared 1 [06:20:34] * legoktm acked 1 [06:20:54] what's up with this /var/tmp/ storage alert on the ping hosts? afaik /var/tmp/ is empty [06:21:51] XioNoX: o/ I think it is the root partition, I am running apt-get clean to free space [06:22:00] ah, thx :) [06:22:08] the root is tiny, and I think old kernels should be wiped [06:22:20] moritzm: --^ ok to do it? [06:26:50] elukey: the best part is to sort the alerts by "duration" [06:31:33] * elukey cries in a corner [06:32:26] <_joe_> there is no set policy on who owns alerts and hwo long they can stay unacked, nor even a cultural norm shared between all SREs [06:32:43] <_joe_> (I prefer norms over policies because it's easier to get buy in) [06:33:02] hnowlan: o/ should we ack the alert for maps1005? Not sure what is its state :( [06:38:04] <_joe_> also: these are 19 unacked criticals, I've seen *much* worse [06:44:53] acked one :-D [06:45:02] thanks :) [06:45:34] godog: o/ some curator units seems to be down for logstash1007, tried to restart / check logs but didn't find much, can you help when you have a moment? [06:49:13] acked two warnings for good measure :-D [06:59:35] elukey: ack! [07:00:36] they have unusually small root partitions, we'll make new instances with the regular 10G "disks" when they get moved to bullseye [07:35:40] elukey: yeah it is known, I'll downtime it [07:36:16] <3 [08:13:51] new alertmanager features: https://twitter.com/PrometheusIO/status/1389675502869299207 [08:16:56] neat [08:58:42] elukey: missed that one, thanks! [09:29:52] [cross-posting] FYI I'll be upgrading spicerack on cumin2001 in few minutes, for critical cookbooks runs please use cumin1001 for the next hour or so. [10:59:01] all looks fine so far for the existing stuff, upgrading cumin1001 too, feel free to use both [11:09:15] FYI all there is a new #cfssl-pki phab tag for any cfssl pki task [11:09:39] ack!\ [11:44:48] anyone know how often https://config-master.wikimedia.org/known_hosts.ecdsa is regenerated? [11:44:59] reimaging db1173 just finished, but it doesn't appear in that file [11:45:56] jbond42: cfssl is failing puppet on deployment-prep hosts, failing to load file puppet:///modules/profile/pki/intermediates/deployment_prep_eqiad1_wikimedia_cloud.pem, the file in puppet is deployment-prep not deployent_prep [11:45:56] https://github.com/wikimedia/puppet/blob/production/modules/profile/files/pki/intermediates/deployment-prep_eqiad1_wikimedia_cloud.pem [11:46:09] is that as simple as renaming or something else? [11:46:35] Majavah: one sec let me take a look [11:46:53] which host [11:47:15] deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud [11:48:35] turns out that config-master points to puppetmaster1001, and running puppet there regen'd the file. [11:51:53] Majavah: about to grab lunch however the following should fix the issue https://gerrit.wikimedia.org/r/c/operations/puppet/+/685758 [11:51:59] will take a look after lunch if not [11:52:01] ack, thanks [11:52:18] fyi that one is merged on prod so just needs syncing [12:22:42] Majavah: seems fixed [15:42:53] marostegui: it's fine to do now [15:43:19] I remember spending a lot of time figuring out last year what the retention is supposed to be of these job's journal logs [15:43:27] and regretfully did not write it down [15:43:40] it seems this one's logs only go back less than 24 hours which is useless [15:43:49] it won't take any earlier date [15:44:36] Krinkle: did your script finish already? [15:45:08] marostegui: I haven't begun, I'm looking at what's running now [15:45:25] the one that started >24h ago is still running [15:45:34] Krinkle: sure, I don't want to end up with two scripts at the same time [15:45:46] yours and the other one :) [15:45:46] OK, after following about 20 layers of indirection I found https://github.com/wikimedia/puppet/blob/production/modules/systemd/templates/logrotate.erb [15:45:49] which suggests daily rotation [15:45:57] and journalctl can't find the older ones? [15:46:25] * Krinkle looks on disk [15:46:28] ther *are* no older ones [15:46:39] so it's rotate with 0 retention [15:47:00] doesn't rotate 15 mean it shoudl keep 15 days? [15:47:25] Krinkle: the old one is still deleting stuff from pc224 (if it goes in order, it still has 30 tables to go) [15:48:36] maybe size overrules rotate. [15:48:43] today's file is 58M in size already [15:49:02] so maybe 'size 256M' kicked in and wiped it all yesterday if it outgrows that [15:49:22] I admitedly don't know logrotate very well besides the basics [15:50:12] ok, someone put an \r in the print lines from purgeParserCache.php which journalctl won't display without --all. [15:53:03] this script invovation started May 2nd [15:53:53] and if Im reading ps output right, it spent 18 hours in /bash before startingn the actual php script [15:54:10] May02 0:00 /bin/bash /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 500 [15:54:10] May02 18:08 php /srv/mediawiki-staging/multiversion/MWScript.php purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 500 [15:54:21] could that be true? I'm feeling very stupid right now. [15:57:58] also any clue on the logrotate stuff, welcome :) [16:00:21] I'm also a bit unsure as to why `interval => '01:00'` means it starts at 00:00. [16:00:57] maybe 01:00 is magic for @daily and it just decided nothing else was happening and started right away [16:04:16] marostegui: I prpose not to run it manually and we just merge the change and wait for the next run [16:05:08] I guess may be the reason it started at 00:00 is that the previous run naturalluy also took several days so it's not that 01:00 mapped to 00:00 but that it's several days waiting unntil the next day rollover and then starts to do it or something. [16:05:41] but yeah, I'm guessing either we've grown a lot, or we just never checked how long it took after we added the sleep for T150124 [16:05:42] T150124: Parsercache purging can create lag - https://phabricator.wikimedia.org/T150124 [16:05:50] maybe it's been running over multi-day ever since hten [16:06:17] Krinkle: sure I can do that later, I need to go afk for now [16:08:45] volans: can I merge your patch? 'python_deploy::venv: fix typo in path' [16:08:52] andrewbogott: go ahead [16:08:53] thx [16:08:54] thx [16:09:03] I just found your lock :) [16:09:05] trying to merge it [16:10:13] we currently wait 0.5ms after each select query (100 rows from 1 parser cache table) which is then folllowed by 0 or 1 delete statements for 0...100 rows. [16:11:02] that feels fairly small and is hardcoded, doesn't use $wgUpdateRowsPerQuery [16:11:26] but UpdateRowsPerQuery is 100 by default and we don't raise that in prod either [16:25:37] ok, it's not 0.5ms (500 usleep), it's getting multiplied somewhere along the way, so it is as bad as I thought, we're waiting 0.5s / 500ms for every 100 rows. That seems a little high. But I do note Aaron and jynus intentionally increased it from 100ms to 500ms in 2016, so maybe we need to.