[14:31:14] <volans>	 jbond42: you around?
[14:32:00] <jbond42>	 yes
[14:32:18] <volans>	 I think that something is not right on puppetdb, see https://grafana.wikimedia.org/d/000000477/puppetdb?panelId=19&fullscreen&orgId=1&from=now-90d&to=now
[14:32:37] <volans>	 this might explain some issues with the reimages we got this past week and also today (cc ema, vgutierrez )
[14:33:30] <volans>	 I'm also seeing a lot of clojure.lang.ExceptionInfo: Value does not match schema: {:job_id disallowed-key} errors in the logs, not sure if related to your "magic change" ;)
[14:33:56] <jbond42>	 ok let me look into it and see what i can find
[14:34:41] <volans>	 basically my current theory is that the downtime failed with no host found because the host was not yet added to puppetdb because of the large queue
[14:35:10] <volans>	 and might explain some odd behaviour too
[14:35:14] <cdanis>	 no idea if related, but I did just separately notice we have a lot of very questionable facts -- ones with prefixes ipaddress6_cali, macaddress_cali, netmask6_cali, which them all seem to have a full MAC address appended to them
[14:35:53] <cdanis>	 e.g. https://puppetboard.wikimedia.org/fact/macaddress_cali015ca68d125
[14:36:13] <volans>	 calico?
[14:36:26] <moritzm>	 ah, job_id is handled by the Lua hack for puppet 5 clients reporting to puppetdb 4, this probably needs some tweaking for the current dual broadcasting
[14:37:48] <jbond42>	 that definetly will need fixing however puppetdb i think will only be going to puppetdb[12]001 so shuoldn't be affected by the dual db chanbge.  Also the queue has been growing since the end of sept long before the buster db's where being populated
[14:38:34] <volans>	 cdanis: clearly something k8s specific (cc akosiaris ), just run ifconfig on one of the k8s hosts ;)
[14:38:41] <volans>	 those are the interface names
[14:39:02] <volans>	 ~30 per host
[14:39:09] <cdanis>	 guessing those fact names are from before the 'structured fact' times?
[14:41:18] <volans>	 jbond42: my worry is that those errors kill a thread and maybe concur to make the queue grow
[14:42:31] <jbond42>	 volans: what do you mean by "those errors" ?
[14:43:13] <volans>	 the job_id ones, I'm not sure if they kill a thread or maybe get rescheduled or what
[14:45:05] <jbond42>	 ahh it llooks like not all the job_id values are being striped out
[14:47:45] * volans brb in 10, need to do one thing before a meeting
[15:07:53] <volans>	 jbond42: in a meeting but let me know if I can help in anyway
[15:09:50] <jbond42>	 thnka, just looking it seems cloudelastic1004.wikimedia.org fails to submit its report each time so ihave a test case
[16:39:56] <jbond42>	 volans: for now i have cleared the queue down so things should get processed better.  the issues lies somewhere in the job_id hack.  however i think we can probably switch to the new db's so ill work on that tomorrow
[16:41:54] <volans>	 jbond42: ack, thanks for looking into it, if you need a task you can use T236684
[16:41:55] <stashbot>	 T236684: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684
[16:42:23] <jbond42>	 ack thankx
[17:18:10] <andrewbogott>	 paladox or mutante, I created a gerrit user from ldap but there was a typo in the ldap username.  I've since fixed the issue in ldap; is it possible to just delete the gerrit user so I can recreate?
[17:19:38] <andrewbogott>	 Username: instance-puppet-used, full name: Instance-puppet-user
[17:19:43] <andrewbogott>	 It's that 'used' that I need to change to 'user'
[17:27:40] <paladox>	 andrewbogott i think it's possible to rename. it'll involve doing it in All-Users
[17:28:02] <andrewbogott>	 paladox: I don't think I know what that means
[17:28:05] <andrewbogott>	 you mean a db change?
[17:28:36] <paladox>	 All-Users is the db :)
[17:28:48] <paladox>	 You'll need thcipriani for this
[17:28:53] <andrewbogott>	 ok
[17:28:57] * andrewbogott moves to -releng
[17:33:21] <mutante>	 we have power outages here again. some areas have power and some don't. cant find a place to sit at the cafe because all the people without power hanging out there already
[17:34:29] <andrewbogott>	 mutante: California snow day!
[17:34:53] <mutante>	 andrewbogott: haha, yes. "power outages and fires"
[17:40:57] <volans>	 jbond42: how did you clear the queue?
[17:41:42] <jbond42>	 oh know why what problem did it cause?
[17:42:00] * volans parse fail
[17:42:22] <jbond42>	 have i made things worse?
[17:42:36] <volans>	 not that I know of :)
[17:42:45] <jbond42>	 anyway i removed the reports in /var/lib/puppetdb/stockpile/cmd/q and restarted
[17:42:59] <volans>	 maybe it should be documeted
[17:43:00] <mutante>	 it's kind of funny how people are freaking out over their cell phones not working (because apparently some towers also dont have power). We have all become so dependent.
[17:43:11] <jbond42>	 it would have ment some reports got lost but it mosty contains reports that failed to submit
[17:43:23] <volans>	 ack
[17:45:54] <jbond42>	 and yes i can document it, ill add wiki and associate with whatever check comes from T236707
[17:45:55] <stashbot>	 T236707: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707
[17:50:19] <volans>	 great, thanks for the follow up
[17:51:00] <jbond42>	 looks like there are ~5 nodes which constantly fail to submit there report which is causing the slow growth in the queue.  not sure why when i look at the debugloggin lua sees no body but then i see an error about in valid job_id in the logs.  puppetdb[21]002 seem to be processin stuff well so ill try to shift a few nodes there and do some tests.  gut feeling is it will be simpler to just swap to
[17:51:06] <jbond42>	 the new masters then debuggin ghits
[17:51:56] <jbond42>	 s/new masters/puppetdbs/
[18:34:32] <volans>	 nice finding
[18:34:34] <volans>	 which nodes?
[18:34:42] <volans>	 which OS/puppet version?
[19:48:32] <jbond42>	 look at the files in /var/lib/puppetdb/stockpile/cmd/q the nodes that repeat, i havn't looked for a pattern yet but the node i looked at today (cloudelastic1004.wikimedia.org) didn;t seem to have anything strange about it
[19:52:27] <jbond42>	 ls -1 /var/lib/puppetdb/stockpile/cmd/q | awk -F_ '{print $NF}'  | sort | uniq -c
[19:54:56] <jbond42>	 cloudelastic100[1-4], mwmaint2001, netmon1002 are the repeaters
[20:01:34] <jbond42>	 cloudelastic100[1-4], mwmaint2001, netmon1002 are the repeaters
[20:38:30] <volans>	 ack