[08:15:12] 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1298558 (10MoritzMuehlenhoff) My from reading these crashes are all related to the networking interface between the host and the virtual machines (vhost_net on the virtualisation server and virtio... [12:45:10] someone in here, maybe yuvi ? reported a problem with batch runs of salt which produced a 'unhashable type" dict' error [12:45:29] if it was you, speak up, I have some news [12:51:53] 6Labs, 7Tracking: Labs Project for Phragile - https://phabricator.wikimedia.org/T99672#1298845 (10Jakob_WMDE) @awjrichards Thanks! [14:03:02] [6~[6~[6~[6~[6~[6~[6~[6~l/e [14:55:36] 6Labs: Fix monitor_labs_salt_keys.py to handle the new labs naming scheme - https://phabricator.wikimedia.org/T95481#1299038 (10ArielGlenn) when you say instance, you mean the ec2 name there, right? or no? give me a sample name with all the pieces in it. [17:22:17] 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1299421 (10yuvipanda) So I guess this would need us to test by: # Upgrading kernel on one host and rebooting (and appropriate housekeeping for instances) # Bring back hosts # Suspend and resume a... [19:09:56] labs_lvm, but just lint https://gerrit.wikimedia.org/r/#/c/211346/ [19:10:13] openstack, but just lint https://gerrit.wikimedia.org/r/#/c/211356/ [19:10:36] i'm doing these and others for https://phabricator.wikimedia.org/T93645 [19:11:18] one more for quarry: https://gerrit.wikimedia.org/r/#/c/211354/ [19:29:45] yuvipanda: were you the one that ran into an issue with salt --batch-size a little while back? [19:29:59] apergos: no, not me [19:30:00] someone in here was and I don't remember who (and there was no ticket of course) [19:30:01] hrm [19:30:06] salt didn't work at all for me :) [19:30:24] well that leaves two other likely suspects, I'll ask 'em later [19:32:57] !log tools disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/ [19:33:04] Logged the message, Master [19:35:26] bd808: maybe you reported an issue with batching an salt? [19:35:48] also I saw your comments on the changeset, I just literally have not tried it at all so it could be completely wrog. [19:35:51] wrong [19:35:58] however cherry picking is cheap, feel free [19:54:16] !log tools enabled puppet on tools-precise-dev [19:54:21] Logged the message, Master [19:54:30] !log tools copy cleaned up hosts file to /etc/hosts on tools-precise-dev [19:54:35] Logged the message, Master [19:56:41] !log tools copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts [19:56:45] Logged the message, Master [20:01:19] !log tools tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good [20:01:27] Logged the message, Master [20:01:28] !log tools enabling puppet on all hosts [20:01:32] Logged the message, Master [20:05:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:05:44] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:05:48] uh oh [20:06:46] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:07:09] 6Labs: Fix labs lvm to not run script every puppet run - https://phabricator.wikimedia.org/T99823#1299852 (10yuvipanda) 3NEW [20:07:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 22.22% of data above the critical threshold [0.0] [20:07:24] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 33.33% of data above the critical threshold [0.0] [20:07:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 30.00% of data above the critical threshold [0.0] [20:07:35] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 22.22% of data above the critical threshold [0.0] [20:07:48] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:07:59] hmm [20:08:03] not sure where these are from [20:08:35] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 50.00% of data above the critical threshold [0.0] [20:08:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 40.00% of data above the critical threshold [0.0] [20:08:53] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 40.00% of data above the critical threshold [0.0] [20:08:55] ah [20:09:09] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 22.22% of data above the critical threshold [0.0] [20:09:10] !log tools transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled [20:09:15] Logged the message, Master [20:09:18] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 33.33% of data above the critical threshold [0.0] [20:10:02] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 40.00% of data above the critical threshold [0.0] [20:10:18] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 33.33% of data above the critical threshold [0.0] [20:13:13] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1299896 (10scfc) [20:14:54] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:17:27] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [20:19:10] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1299908 (10yuvipanda) We still need to move these to DNS. Need to set up a bunch of blocker tasks for that (moving to designate, split horizon, etc) [20:19:11] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [20:20:17] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [20:20:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [20:21:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [20:23:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [20:23:51] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [20:24:13] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [20:27:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [20:29:58] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [20:30:46] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0] [20:33:36] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [20:37:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [20:37:32] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [20:37:50] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [20:58:30] !log deployment-prep updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266 [20:58:35] Logged the message, Master [21:03:44] (03CR) 10Lucie Kaffee: "I'm reviewing it now. Would be nice to have some kind of documentation additionally." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa) [21:24:45] (03CR) 10Lucie Kaffee: [C: 031] "I'd merge it like this, looks good to me." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa) [21:29:31] (03CR) 10Ricordisamoa: "It should indeed be more documented. And maybe less hackish here and there." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa) [21:29:56] (03CR) 10Ricordisamoa: [C: 032 V: 032] Initial commit [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa) [21:55:49] I read the mail that's currently linked to in the channel topic about the fingerprint changing. [21:56:01] Does that apply to a Labs instance too? [21:56:13] ssh just complained about a fingerprint change [21:56:33] and the current ECDSA key fingerprint is different from the one in that email [21:58:01] polybuildr: which instance? [21:59:40] JohnLewis: spam-honeypot.eqiad.wmflabs [21:59:51] I'm guessing that's the identifier you're looking for? [22:01:11] polybuildr: yeah. the fingerprint of the instance should not have changed unless it was recently rebuilt [22:01:35] rebuilt? not by me, at least. [22:04:02] JohnLewis: Anything suspicious about that? [22:04:39] polybuildr: the instance you listed doesn't exist apparently. Did you mean honeypot-wiki-alpha.eqiad.wmflabs? [22:05:19] according to the instance page doesn't seem like anything changed that would change the fingerprint so, I'm unsure. [22:05:39] JohnLewis: Ouch. [22:06:07] Hold on a minute, I'm making a mistake. [22:07:33] JohnLewis: Right, that was an old instance. :P I made a new one with the newer name and was trying to ssh to the old one for some reason. [22:07:54] that's why then :)