[08:15:12] <wikibugs>	 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1298558 (10MoritzMuehlenhoff) My from reading these crashes are all related to the networking interface between the host and the virtual machines (vhost_net on the virtualisation server and virtio...
[12:45:10] <apergos>	 someone in here, maybe yuvi ? reported a problem with batch runs of salt which produced a 'unhashable type" dict' error
[12:45:29] <apergos>	 if it was you, speak up, I have some news 
[12:51:53] <wikibugs>	 6Labs, 7Tracking: Labs Project for Phragile - https://phabricator.wikimedia.org/T99672#1298845 (10Jakob_WMDE) @awjrichards Thanks!
[14:03:02] <gifti>	 [6~[6~[6~[6~[6~[6~[6~[6~l/e
[14:55:36] <wikibugs>	 6Labs: Fix monitor_labs_salt_keys.py to handle the new labs naming scheme - https://phabricator.wikimedia.org/T95481#1299038 (10ArielGlenn) when you say instance, you mean the ec2 name there, right? or no? give me a sample name with all the pieces in it.
[17:22:17] <wikibugs>	 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1299421 (10yuvipanda) So I guess this would need us to test by:  # Upgrading kernel on one host and rebooting (and appropriate housekeeping for instances) # Bring back hosts # Suspend and resume a...
[19:09:56] <mutante>	 labs_lvm, but just lint https://gerrit.wikimedia.org/r/#/c/211346/
[19:10:13] <mutante>	 openstack, but just lint https://gerrit.wikimedia.org/r/#/c/211356/
[19:10:36] <mutante>	 i'm doing these and others for https://phabricator.wikimedia.org/T93645
[19:11:18] <mutante>	 one more for quarry: https://gerrit.wikimedia.org/r/#/c/211354/
[19:29:45] <apergos>	 yuvipanda: were you the one that ran into an issue with salt --batch-size  a little while back?
[19:29:59] <yuvipanda>	 apergos: no, not me
[19:30:00] <apergos>	 someone in here was and I don't remember who (and there was no ticket of course)
[19:30:01] <apergos>	 hrm
[19:30:06] <yuvipanda>	 salt didn't work at all for me :)
[19:30:24] <apergos>	 well that leaves two other likely suspects, I'll ask 'em later
[19:32:57] <yuvipanda>	 !log tools disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
[19:33:04] <labs-morebots>	 Logged the message, Master
[19:35:26] <apergos>	 bd808: maybe you reported an issue with batching an salt?
[19:35:48] <apergos>	 also I saw your comments on the changeset, I just literally have not tried it at all so it could be completely wrog.
[19:35:51] <apergos>	 wrong
[19:35:58] <apergos>	 however cherry picking is cheap, feel free
[19:54:16] <yuvipanda>	 !log tools enabled puppet on tools-precise-dev
[19:54:21] <labs-morebots>	 Logged the message, Master
[19:54:30] <yuvipanda>	 !log tools copy cleaned up hosts file to /etc/hosts on tools-precise-dev
[19:54:35] <labs-morebots>	 Logged the message, Master
[19:56:41] <yuvipanda>	 !log tools copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
[19:56:45] <labs-morebots>	 Logged the message, Master
[20:01:19] <yuvipanda>	 !log tools tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
[20:01:27] <labs-morebots>	 Logged the message, Master
[20:01:28] <yuvipanda>	 !log tools enabling puppet on all hosts
[20:01:32] <labs-morebots>	 Logged the message, Master
[20:05:32] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 20.00% of data above the critical threshold [0.0]
[20:05:44] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 20.00% of data above the critical threshold [0.0]
[20:05:48] <yuvipanda>	 uh oh
[20:06:46] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 20.00% of data above the critical threshold [0.0]
[20:07:09] <wikibugs>	 6Labs: Fix labs lvm to not run script every puppet run - https://phabricator.wikimedia.org/T99823#1299852 (10yuvipanda) 3NEW
[20:07:22] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 22.22% of data above the critical threshold [0.0]
[20:07:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 33.33% of data above the critical threshold [0.0]
[20:07:34] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 30.00% of data above the critical threshold [0.0]
[20:07:35] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 22.22% of data above the critical threshold [0.0]
[20:07:48] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 20.00% of data above the critical threshold [0.0]
[20:07:59] <yuvipanda>	 hmm
[20:08:03] <yuvipanda>	 not sure where these are from
[20:08:35] <shinken-wm>	 PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 50.00% of data above the critical threshold [0.0]
[20:08:49] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 40.00% of data above the critical threshold [0.0]
[20:08:53] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 40.00% of data above the critical threshold [0.0]
[20:08:55] <yuvipanda>	 ah
[20:09:09] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 22.22% of data above the critical threshold [0.0]
[20:09:10] <yuvipanda>	 !log tools transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
[20:09:15] <labs-morebots>	 Logged the message, Master
[20:09:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 33.33% of data above the critical threshold [0.0]
[20:10:02] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 40.00% of data above the critical threshold [0.0]
[20:10:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 33.33% of data above the critical threshold [0.0]
[20:13:13] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1299896 (10scfc)
[20:14:54] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-07 is CRITICAL 20.00% of data above the critical threshold [0.0]
[20:17:27] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0]
[20:19:10] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1299908 (10yuvipanda) We still need to move these to DNS. Need to set up a bunch of blocker tasks for that (moving to designate, split horizon, etc)
[20:19:11] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0]
[20:20:17] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0]
[20:20:33] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0]
[20:21:43] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0]
[20:23:49] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0]
[20:23:51] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0]
[20:24:13] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0]
[20:27:22] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0]
[20:29:58] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0]
[20:30:46] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0]
[20:33:36] <shinken-wm>	 RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0]
[20:37:26] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0]
[20:37:32] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0]
[20:37:50] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0]
[20:58:30] <cscott>	 !log deployment-prep updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266
[20:58:35] <labs-morebots>	 Logged the message, Master
[21:03:44] <grrrit-wm>	 (03CR) 10Lucie Kaffee: "I'm reviewing it now. Would be nice to have some kind of documentation additionally." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa)
[21:24:45] <grrrit-wm>	 (03CR) 10Lucie Kaffee: [C: 031] "I'd merge it like this, looks good to me." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa)
[21:29:31] <grrrit-wm>	 (03CR) 10Ricordisamoa: "It should indeed be more documented. And maybe less hackish here and there." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa)
[21:29:56] <grrrit-wm>	 (03CR) 10Ricordisamoa: [C: 032 V: 032] Initial commit [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa)
[21:55:49] <polybuildr>	 I read the mail that's currently linked to in the channel topic about the fingerprint changing.
[21:56:01] <polybuildr>	 Does that apply to a Labs instance too?
[21:56:13] <polybuildr>	 ssh just complained about a fingerprint change
[21:56:33] <polybuildr>	 and the current ECDSA key fingerprint is different from the one in that email
[21:58:01] <JohnLewis>	 polybuildr: which instance?
[21:59:40] <polybuildr>	 JohnLewis: spam-honeypot.eqiad.wmflabs
[21:59:51] <polybuildr>	 I'm guessing that's the identifier you're looking for?
[22:01:11] <JohnLewis>	 polybuildr: yeah. the fingerprint of the instance should not have changed unless it was recently rebuilt
[22:01:35] <polybuildr>	 rebuilt? not by me, at least.
[22:04:02] <polybuildr>	 JohnLewis: Anything suspicious about that?
[22:04:39] <JohnLewis>	 polybuildr: the instance you listed doesn't exist apparently. Did you mean honeypot-wiki-alpha.eqiad.wmflabs?
[22:05:19] <JohnLewis>	 according to the instance page doesn't seem like anything changed that would change the fingerprint so, I'm unsure.
[22:05:39] <polybuildr>	 JohnLewis: Ouch.
[22:06:07] <polybuildr>	 Hold on a minute, I'm making a mistake.
[22:07:33] <polybuildr>	 JohnLewis: Right, that was an old instance. :P I made a new one with the newer name and was trying to ssh to the old one for some reason.
[22:07:54] <JohnLewis>	 that's why then :)