[06:08:18] <elukey>	 hello folks
[06:08:33] <elukey>	 icinga seems to be a christmas tree, even if nothing super horrible is happening :D
[06:20:28] <XioNoX>	 I cleared 1
[06:20:34] * legoktm acked 1
[06:20:54] <XioNoX>	 what's up with this /var/tmp/ storage alert on the ping hosts? afaik /var/tmp/ is empty
[06:21:51] <elukey>	 XioNoX: o/ I think it is the root partition, I am running apt-get clean to free space
[06:22:00] <XioNoX>	 ah, thx :)
[06:22:08] <elukey>	 the root is tiny, and I think old kernels should be wiped
[06:22:20] <elukey>	 moritzm: --^ ok to do it?
[06:26:50] <XioNoX>	 elukey: the best part is to sort the alerts by "duration"
[06:31:33] * elukey cries in a corner
[06:32:26] <_joe_>	 there is no set policy on who owns alerts and hwo long they can stay unacked, nor even a cultural norm shared between all SREs
[06:32:43] <_joe_>	 (I prefer norms over policies because it's easier to get buy in)
[06:33:02] <elukey>	 hnowlan: o/ should we ack the alert for maps1005? Not sure what is its state :(
[06:38:04] <_joe_>	 also: these are 19 unacked criticals, I've seen *much* worse
[06:44:53] <apergos>	 acked one :-D
[06:45:02] <elukey>	 thanks :)
[06:45:34] <elukey>	 godog: o/ some curator units seems to be down for logstash1007, tried to restart / check logs but didn't find much, can you help when you have a moment?
[06:49:13] <apergos>	 acked two warnings for good measure :-D
[06:59:35] <moritzm>	 elukey: ack!
[07:00:36] <moritzm>	 they have unusually small root partitions, we'll make new instances with the regular 10G "disks" when they get moved to bullseye
[07:35:40] <godog>	 elukey: yeah it is known, I'll downtime it
[07:36:16] <elukey>	 <3
[08:13:51] <jynus>	 new alertmanager features: https://twitter.com/PrometheusIO/status/1389675502869299207
[08:16:56] <godog>	 neat
[08:58:42] <hnowlan>	 elukey: missed that one, thanks!
[09:29:52] <volans>	 [cross-posting] FYI I'll be upgrading spicerack on cumin2001 in few minutes, for critical cookbooks runs please use cumin1001 for the next hour or so.
[10:59:01] <volans>	 all looks fine so far for the existing stuff, upgrading cumin1001 too, feel free to use both
[11:09:15] <jbond42>	 FYI all there is a new #cfssl-pki phab tag for any cfssl pki task
[11:09:39] <volans>	 ack!\
[11:44:48] <kormat>	 anyone know how often https://config-master.wikimedia.org/known_hosts.ecdsa is regenerated?
[11:44:59] <kormat>	 reimaging db1173 just finished, but it doesn't appear in that file
[11:45:56] <Majavah>	 jbond42: cfssl is failing puppet on deployment-prep hosts, failing to load file puppet:///modules/profile/pki/intermediates/deployment_prep_eqiad1_wikimedia_cloud.pem, the file in puppet is deployment-prep not deployent_prep
[11:45:56] <Majavah>	 https://github.com/wikimedia/puppet/blob/production/modules/profile/files/pki/intermediates/deployment-prep_eqiad1_wikimedia_cloud.pem
[11:46:09] <Majavah>	 is that as simple as renaming or something else?
[11:46:35] <jbond42>	 Majavah: one sec let me take a look
[11:46:53] <jbond42>	 which host
[11:47:15] <Majavah>	 deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud
[11:48:35] <kormat>	 turns out that config-master points to puppetmaster1001, and running puppet there regen'd the file.
[11:51:53] <jbond42>	 Majavah: about to grab lunch however the following should fix the issue https://gerrit.wikimedia.org/r/c/operations/puppet/+/685758
[11:51:59] <jbond42>	 will take a look after lunch if not
[11:52:01] <Majavah>	 ack, thanks
[11:52:18] <jbond42>	 fyi that one is merged on prod so just needs syncing
[12:22:42] <jbond42>	 Majavah: seems fixed
[15:42:53] <Krinkle>	 marostegui: it's fine to do now
[15:43:19] <Krinkle>	 I remember spending a lot of time figuring out last year what the retention is supposed to be of these job's journal logs
[15:43:27] <Krinkle>	 and regretfully did not write it down
[15:43:40] <Krinkle>	 it seems this one's logs only go back less than 24 hours which is useless
[15:43:49] <Krinkle>	 it won't take any earlier date
[15:44:36] <marostegui>	 Krinkle: did your script finish already?
[15:45:08] <Krinkle>	 marostegui: I haven't begun, I'm looking at what's running now
[15:45:25] <Krinkle>	 the one that started >24h ago is still running
[15:45:34] <marostegui>	 Krinkle: sure, I don't want to end up with two scripts at the same time
[15:45:46] <marostegui>	 yours and the other one :)
[15:45:46] <Krinkle>	 OK, after following about 20 layers of indirection I found https://github.com/wikimedia/puppet/blob/production/modules/systemd/templates/logrotate.erb
[15:45:49] <Krinkle>	 which suggests daily rotation
[15:45:57] <Krinkle>	 and journalctl can't find the older ones?
[15:46:25] * Krinkle looks on disk
[15:46:28] <Krinkle>	 ther *are* no older ones
[15:46:39] <Krinkle>	 so it's rotate with 0 retention
[15:47:00] <Krinkle>	 doesn't rotate 15 mean it shoudl keep 15 days?
[15:47:25] <marostegui>	 Krinkle: the old one is still deleting stuff from pc224 (if it goes in order, it still has 30 tables to go)
[15:48:36] <Krinkle>	 maybe size overrules rotate.
[15:48:43] <Krinkle>	 today's file is 58M in size already
[15:49:02] <Krinkle>	 so maybe 'size 256M' kicked in and wiped it all yesterday if it outgrows that
[15:49:22] <Krinkle>	 I admitedly don't know logrotate very well besides the basics
[15:50:12] <Krinkle>	 ok, someone put an \r in the print lines from purgeParserCache.php which journalctl won't display without --all.
[15:53:03] <Krinkle>	 this script invovation started May 2nd
[15:53:53] <Krinkle>	 and if Im reading ps output right,  it spent 18 hours in /bash before startingn the actual php script
[15:54:10] <Krinkle>	  May02   0:00 /bin/bash /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 500
[15:54:10] <Krinkle>	 May02  18:08 php /srv/mediawiki-staging/multiversion/MWScript.php purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 500
[15:54:21] <Krinkle>	 could that be true? I'm feeling very stupid right now.
[15:57:58] <Krinkle>	 also any clue on the logrotate stuff, welcome :)
[16:00:21] <Krinkle>	 I'm also a bit unsure as to why `interval => '01:00'` means it starts at 00:00.
[16:00:57] <Krinkle>	 maybe 01:00 is magic for @daily and it just decided nothing else was happening and started right away
[16:04:16] <Krinkle>	 marostegui: I prpose not to run it manually and we just merge the change and wait for the next run
[16:05:08] <Krinkle>	 I guess may be the reason it started at 00:00 is that the previous run  naturalluy also took several days so it's not that 01:00 mapped to 00:00 but that it's several days waiting unntil the next day rollover and then starts to do it or something.
[16:05:41] <Krinkle>	 but yeah, I'm guessing either we've grown a lot, or we just never checked how long it took after we added the sleep for T150124
[16:05:42] <stashbot>	 T150124: Parsercache purging can create lag - https://phabricator.wikimedia.org/T150124
[16:05:50] <Krinkle>	 maybe it's been running over multi-day ever since hten
[16:06:17] <marostegui>	 Krinkle: sure I can do that later, I need to go afk for now
[16:08:45] <andrewbogott>	 volans: can I merge your patch?  'python_deploy::venv: fix typo in path'
[16:08:52] <volans>	 andrewbogott: go ahead
[16:08:53] <volans>	 thx
[16:08:54] <andrewbogott>	 thx
[16:09:03] <volans>	 I just found your lock :)
[16:09:05] <volans>	 trying to merge it
[16:10:13] <Krinkle>	 we currently wait 0.5ms after each select query (100 rows from 1 parser cache table) which is then folllowed by 0 or 1 delete statements for 0...100 rows.
[16:11:02] <Krinkle>	 that feels fairly small and is hardcoded, doesn't use $wgUpdateRowsPerQuery
[16:11:26] <Krinkle>	 but UpdateRowsPerQuery is 100 by default and we don't raise that in prod either
[16:25:37] <Krinkle>	 ok, it's not 0.5ms (500 usleep), it's getting multiplied somewhere along the way, so it is as bad as I thought, we're waiting 0.5s / 500ms for every 100 rows. That seems a little high. But I do note Aaron and jynus intentionally increased it from 100ms to 500ms in 2016, so maybe we need to.