[14:31:14] jbond42: you around? [14:32:00] yes [14:32:18] I think that something is not right on puppetdb, see https://grafana.wikimedia.org/d/000000477/puppetdb?panelId=19&fullscreen&orgId=1&from=now-90d&to=now [14:32:37] this might explain some issues with the reimages we got this past week and also today (cc ema, vgutierrez ) [14:33:30] I'm also seeing a lot of clojure.lang.ExceptionInfo: Value does not match schema: {:job_id disallowed-key} errors in the logs, not sure if related to your "magic change" ;) [14:33:56] ok let me look into it and see what i can find [14:34:41] basically my current theory is that the downtime failed with no host found because the host was not yet added to puppetdb because of the large queue [14:35:10] and might explain some odd behaviour too [14:35:14] no idea if related, but I did just separately notice we have a lot of very questionable facts -- ones with prefixes ipaddress6_cali, macaddress_cali, netmask6_cali, which them all seem to have a full MAC address appended to them [14:35:53] e.g. https://puppetboard.wikimedia.org/fact/macaddress_cali015ca68d125 [14:36:13] calico? [14:36:26] ah, job_id is handled by the Lua hack for puppet 5 clients reporting to puppetdb 4, this probably needs some tweaking for the current dual broadcasting [14:37:48] that definetly will need fixing however puppetdb i think will only be going to puppetdb[12]001 so shuoldn't be affected by the dual db chanbge. Also the queue has been growing since the end of sept long before the buster db's where being populated [14:38:34] cdanis: clearly something k8s specific (cc akosiaris ), just run ifconfig on one of the k8s hosts ;) [14:38:41] those are the interface names [14:39:02] ~30 per host [14:39:09] guessing those fact names are from before the 'structured fact' times? [14:41:18] jbond42: my worry is that those errors kill a thread and maybe concur to make the queue grow [14:42:31] volans: what do you mean by "those errors" ? [14:43:13] the job_id ones, I'm not sure if they kill a thread or maybe get rescheduled or what [14:45:05] ahh it llooks like not all the job_id values are being striped out [14:47:45] * volans brb in 10, need to do one thing before a meeting [15:07:53] jbond42: in a meeting but let me know if I can help in anyway [15:09:50] thnka, just looking it seems cloudelastic1004.wikimedia.org fails to submit its report each time so ihave a test case [16:39:56] volans: for now i have cleared the queue down so things should get processed better. the issues lies somewhere in the job_id hack. however i think we can probably switch to the new db's so ill work on that tomorrow [16:41:54] jbond42: ack, thanks for looking into it, if you need a task you can use T236684 [16:41:55] T236684: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 [16:42:23] ack thankx [17:18:10] paladox or mutante, I created a gerrit user from ldap but there was a typo in the ldap username. I've since fixed the issue in ldap; is it possible to just delete the gerrit user so I can recreate? [17:19:38] Username: instance-puppet-used, full name: Instance-puppet-user [17:19:43] It's that 'used' that I need to change to 'user' [17:27:40] andrewbogott i think it's possible to rename. it'll involve doing it in All-Users [17:28:02] paladox: I don't think I know what that means [17:28:05] you mean a db change? [17:28:36] All-Users is the db :) [17:28:48] You'll need thcipriani for this [17:28:53] ok [17:28:57] * andrewbogott moves to -releng [17:33:21] we have power outages here again. some areas have power and some don't. cant find a place to sit at the cafe because all the people without power hanging out there already [17:34:29] mutante: California snow day! [17:34:53] andrewbogott: haha, yes. "power outages and fires" [17:40:57] jbond42: how did you clear the queue? [17:41:42] oh know why what problem did it cause? [17:42:00] * volans parse fail [17:42:22] have i made things worse? [17:42:36] not that I know of :) [17:42:45] anyway i removed the reports in /var/lib/puppetdb/stockpile/cmd/q and restarted [17:42:59] maybe it should be documeted [17:43:00] it's kind of funny how people are freaking out over their cell phones not working (because apparently some towers also dont have power). We have all become so dependent. [17:43:11] it would have ment some reports got lost but it mosty contains reports that failed to submit [17:43:23] ack [17:45:54] and yes i can document it, ill add wiki and associate with whatever check comes from T236707 [17:45:55] T236707: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 [17:50:19] great, thanks for the follow up [17:51:00] looks like there are ~5 nodes which constantly fail to submit there report which is causing the slow growth in the queue. not sure why when i look at the debugloggin lua sees no body but then i see an error about in valid job_id in the logs. puppetdb[21]002 seem to be processin stuff well so ill try to shift a few nodes there and do some tests. gut feeling is it will be simpler to just swap to [17:51:06] the new masters then debuggin ghits [17:51:56] s/new masters/puppetdbs/ [18:34:32] nice finding [18:34:34] which nodes? [18:34:42] which OS/puppet version? [19:48:32] look at the files in /var/lib/puppetdb/stockpile/cmd/q the nodes that repeat, i havn't looked for a pattern yet but the node i looked at today (cloudelastic1004.wikimedia.org) didn;t seem to have anything strange about it [19:52:27] ls -1 /var/lib/puppetdb/stockpile/cmd/q | awk -F_ '{print $NF}' | sort | uniq -c [19:54:56] cloudelastic100[1-4], mwmaint2001, netmon1002 are the repeaters [20:01:34] cloudelastic100[1-4], mwmaint2001, netmon1002 are the repeaters [20:38:30] ack