[00:05:38] 06Labs: Instances broken on initial provision with dns setup issues - https://phabricator.wikimedia.org/T126580#2521306 (10AlexMonk-WMF) 05Open>03Resolved Assuming this was fixed then. [00:11:57] RECOVERY - Puppet run on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [00:12:01] 06Labs, 10Horizon: DNS Domains view in Horizon for Tools project displays only one domain - https://phabricator.wikimedia.org/T131334#2521325 (10AlexMonk-WMF) 05Open>03Invalid tools-login.wmflabs.org is a record under wmflabs.org, not a domain of it's own, so is owned by the wmflabsdotorg project. If tools... [00:12:27] RECOVERY - Puppet run on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [00:18:53] !log bastion added Krenair as admin to help with T132225 and other issues [00:18:54] T132225: Add SSHFP dns records to bastions - https://phabricator.wikimedia.org/T132225 [00:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Bastion/SAL, Master [00:19:08] !log tools added Krenair as admin to help with T132225 and other issues. [00:19:09] T132225: Add SSHFP dns records to bastions - https://phabricator.wikimedia.org/T132225 [00:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:21:37] 06Labs, 10Tool-Labs, 13Patch-For-Review: Convert most top level tool and bastion dns redcords to CNAMEs - https://phabricator.wikimedia.org/T131796#2521337 (10AlexMonk-WMF) @Andrew, @yuvipanda: I suggest that, instead of doing this, we briefly remove the records, create domains for each, recreate the records... [00:25:24] 06Labs, 10Tool-Labs, 13Patch-For-Review: Convert most top level tool and bastion dns redcords to CNAMEs - https://phabricator.wikimedia.org/T131796#2521342 (10AlexMonk-WMF) Proposed new domains and owning project: * huggle-rc.wmflabs.org - huggle * tools-checker.wmflabs.org - tools * tools-dev.wmflabs.org -... [00:51:25] 06Labs, 10Tool-Labs, 13Patch-For-Review: Convert most top level tool and bastion dns records to CNAMEs - https://phabricator.wikimedia.org/T131796#2521398 (10tom29739) [00:52:40] 06Labs, 10Tool-Labs, 13Patch-For-Review: Convert most top level tool and bastion dns records to CNAMEs - https://phabricator.wikimedia.org/T131796#2178815 (10tom29739) [00:52:42] 06Labs, 10Tool-Labs: Add SSHFP dns records to bastions - https://phabricator.wikimedia.org/T132225#2521400 (10AlexMonk-WMF) I just tried this with primary.bastion.wmflabs.org, but Horizon won't let me create algorithm 3 (ECDSA) or algorithm 4 (ED25519) fingerprint records, making SSH *really* hate logging in.... [01:08:13] 06Labs, 10Tool-Labs: Add SSHFP dns records to bastions - https://phabricator.wikimedia.org/T132225#2521426 (10AlexMonk-WMF) Fix submitted upstream for validation of ED25519 records: https://review.openstack.org/350847 [01:23:17] 06Labs, 10Tool-Labs, 07Upstream: Add SSHFP dns records to bastions - https://phabricator.wikimedia.org/T132225#2521447 (10AlexMonk-WMF) 05Open>03stalled Fix submitted upstream to designate for validation of ECDSA, ED25519 and SHA-256 records: https://review.openstack.org/350850 Marking this as stalled, I... [01:26:02] 06Labs, 10Tool-Labs, 07Upstream: Add SSHFP dns records to bastions - https://phabricator.wikimedia.org/T132225#2521451 (10AlexMonk-WMF) (Removing the primary.bastion.wmflabs.org SSHFP records until then) [02:11:51] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 07Tracking: Log files on labs instance fill up disk (/var is only 2GB) (tracking) - https://phabricator.wikimedia.org/T71601#2521463 (10Tgr) [02:48:03] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2521469 (10yuvipanda) [03:23:00] harej I rebooted the librarybase instance, should be back up soon [03:23:15] yuvipanda: so what exactly happened? and how do we prevent it from happening again? [03:25:29] harej we don't really know yet. we're collecting information in that ticket to try to spot a cause so we can fix it [03:25:38] which ticket? [03:25:39] (https://phabricator.wikimedia.org/T141673#2521469 ticket) [03:26:15] 06Labs, 10Labs-project-Librarybase: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2521481 (10Harej) [03:27:03] harej since it is for across all labs projects, I'd prefer to keep per-project tags off it - since otherwise it'd have accumulated about 5-6 tags already. do you mind if I remove that? [03:27:10] Sure [03:27:26] 06Labs, 10Labs-Infrastructure: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2521483 (10yuvipanda) [03:27:28] thanks [03:35:28] PROBLEM - Host tools-docker-builder-03 is DOWN: CRITICAL - Host Unreachable (10.68.19.7) [05:12:50] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Bowleerin was created, changed by Bowleerin link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Bowleerin edit summary: Created page with "{{Tools Access Request |Justification=I would like to create tools to help editors and make my bot work more easy. |Completed=false |User Name=Bowleerin }}" [05:28:26] hi, is Krinkle there or anybody, who knows .js [05:34:23] Thibaut120094 , have you knowledge in js [05:34:37] no [05:34:47] psychoslave, you? [05:35:03] hi doctaxon [05:35:22] what is your question? [05:43:04] psychoslave: to get a new button with link in p-personal I have this: [05:43:08] mw.util.addPortletLink( "p-personal",server + "/wiki/" + "Special:Contributions/TaxonBot","TaxonBot" ); [05:48:30] psychoslave: and my question is, how do I get a second a link behind the first link in the same button? [06:41:07] 06Labs, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2521553 (10ema) @madhuvishy: your PGP key does not seem to be signed yet: https://wikitech.wikimedia.org/wiki/PGP_Keys#Signing_keys. Ping me if you want... [07:43:08] (03PS1) 10Alexandros Kosiaris: Add various missing secrets [labs/private] - 10https://gerrit.wikimedia.org/r/302883 [07:44:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add various missing secrets [labs/private] - 10https://gerrit.wikimedia.org/r/302883 (owner: 10Alexandros Kosiaris) [08:18:13] (03PS1) 10Lokal Profil: Replace TestFillTableMonumentsBase by CustomAssertions [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/302887 [10:47:10] PROBLEM - Free space - all mounts on tools-docker-registry-01 is CRITICAL: CRITICAL: tools.tools-docker-registry-01.diskspace.root.byte_percentfree (<44.44%) [11:24:25] (03PS1) 10Alexandros Kosiaris: Move labtest hiera into host specific configs [labs/private] - 10https://gerrit.wikimedia.org/r/302905 [11:25:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move labtest hiera into host specific configs [labs/private] - 10https://gerrit.wikimedia.org/r/302905 (owner: 10Alexandros Kosiaris) [13:02:46] [1;3C[1;3C1 [13:29:02] (03CR) 10Paladox: "I'm going to apply this patch to the bot now." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302914 (owner: 10Paladox) [13:35:11] !log tools.lolrrit-wm i ran npm install and npm update [13:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [13:39:03] (03Abandoned) 10Paladox: Use message.patchSet.uploader.name [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302914 (owner: 10Paladox) [13:39:57] (03PS1) 10Paladox: Use message.uploader.name instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302916 [13:41:04] (03CR) 10Paladox: [C: 031] "This will only work when we upgrade to gerrit 2.12.4 we should still merge this anyways." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302916 (owner: 10Paladox) [13:47:19] 10PAWS: Paws display 502 - Bad gateway error - https://phabricator.wikimedia.org/T140578#2522183 (10Ivanhercaz) Recently no. Only one time happens that my two terminals shutdown and I had to restart it and re-configure de user-config.py [13:54:59] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Heikki-hakala was created, changed by Heikki-hakala link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Heikki-hakala edit summary: Created page with "{{Tools Access Request |Justification=Development of ElasticSearch and R analytic tools for personal use |Completed=false |User Name=Heikki-hakala }}" [13:59:08] (03Draft2) 10Paladox: Update some packages [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302918 [13:59:25] (03Draft1) 10Paladox: Update some packages [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302918 [14:00:07] (03CR) 10Paladox: "I'm going to cherry pick this to test this on the bot." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302918 (owner: 10Paladox) [14:03:40] !log tools.lolrrit-wm cherry picking https://gerrit.wikimedia.org/r/#/c/302918/ to test [14:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [14:14:40] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:14:40] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:17:00] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:17:18] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:17:25] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:17:29] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:17:33] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:17:47] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:17:51] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:17:55] PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:18:06] Krenair yuvipanda ^^ it seems puppet is failing [14:18:11] PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:18:35] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:19:01] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:19:16] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:19:44] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:22:04] PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:22:58] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:23:40] PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:23:42] PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:25:48] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:26:45] (03PS1) 10Elukey: Renamed the analytics deploy keyholder ssh keypair files [labs/private] - 10https://gerrit.wikimedia.org/r/302925 [14:27:09] (03CR) 10Elukey: [C: 032 V: 032] Renamed the analytics deploy keyholder ssh keypair files [labs/private] - 10https://gerrit.wikimedia.org/r/302925 (owner: 10Elukey) [14:27:50] PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:27:57] PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:28:25] PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:30:14] PROBLEM - Puppet run on tools-worker-1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:30:14] PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:30:28] PROBLEM - Puppet run on tools-worker-1014 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:31:14] PROBLEM - Puppet run on tools-worker-1015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:32:28] PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:32:39] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:33:37] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:33:53] PROBLEM - Puppet run on tools-elastic-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:37:32] PROBLEM - Puppet run on tools-worker-1023 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:38:26] PROBLEM - Puppet run on tools-worker-1007 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:39:28] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:40:14] PROBLEM - Puppet run on tools-flannel-etcd-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:40:36] PROBLEM - Puppet run on tools-worker-1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:40:44] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:41:44] PROBLEM - Puppet run on tools-proxy-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:42:30] PROBLEM - Puppet run on tools-worker-1002 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:50:45] PROBLEM - Puppet run on tools-puppetmaster-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:57:30] hIIiii v confusing issue in deployment-prep [14:57:33] i just made a new instance [14:57:39] everything is working fine [14:58:12] yuvipanda: looks like everything on teh tools master is failing? [14:58:28] actually, let me try 2 more things before complaining... [14:59:25] ok, yeah. so [14:59:27] i have a new host [14:59:42] it can talk to some instances in deployment prep [14:59:47] but, the one it needs to talk to, it can't. [14:59:51] not on any port [14:59:59] ping works, but but udp and tcp can't get through [15:00:07] there's nothing special about this host, other than that it is new [15:00:10] ottomata: yesterday iirc we discovered a delay in security group setup [15:00:14] oh? [15:00:22] give it a few minutes may be artifiact of liberty upgrade [15:00:33] it was tail end of my day so I'm not sure what the deal was [15:01:14] There was a patch merged that broke tools, and likely also affects other labs hosts. Has bren fixed/reverted ( see -operations) [15:01:28] ah [15:01:50] ok will wait a few [15:05:11] chasemp: not sure if this is relevant, but i also can't initiate a connection from this one already existing box, to the new box [15:05:14] RECOVERY - Puppet run on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0] [15:05:30] but i can from other boxes [15:05:44] RECOVERY - Puppet run on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:06:12] RECOVERY - Puppet run on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [15:06:15] ottomata: I'm not sure I grok that, can you put this in a task? maybe an issue is persisting and we'll let a bit of dust settle here but I'm sure andrew b. will want details [15:06:43] RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:07:51] RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:08:19] chasemp: ok :/ rephrasing too [15:08:25] RECOVERY - Puppet run on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0] [15:08:26] boxA existed for a long time [15:08:29] new boxB [15:08:38] boxB cannot talk to boxA [15:08:42] butA cannot talk to boxB [15:08:51] but boxA can talk to boxes C-Z (all exsiting for a long time) [15:08:56] and boxB can talk to boxes C-Z [15:09:06] just no boxA<->boxB connection [15:09:08] ah that is more specific and odd [15:10:14] RECOVERY - Puppet run on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:26] RECOVERY - Puppet run on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [15:12:26] RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:12:36] RECOVERY - Puppet run on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [15:12:47] chasemp Ottomata that was exactly what I was seeing yesterday [15:13:00] yuvipanda: what resolved it? [15:13:10] I waited 10-15min [15:13:22] Up to 20 in some cases [15:13:26] RECOVERY - Puppet run on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [15:13:36] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:13:52] RECOVERY - Puppet run on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:14:33] some initial setup is being dropped and some eventual consistency is way late maybe [15:14:38] definitely need to get this on a task tho [15:15:14] RECOVERY - Puppet run on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:15:35] RECOVERY - Puppet run on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [15:16:03] Yeah [15:16:05] Hi it seems that qa-morebots [15:16:10] hasent joined -releng [15:16:16] meaning that channel has no log bot [15:16:18] (am going to go back afk shortly) [15:16:22] Could someone have a look please? [15:16:26] Ottomata can you file a task? [15:16:36] yuvipanda: ja can [15:16:37] yuvipanda i found the problem with grrrit-wm too [15:16:38] gimme a few [15:17:29] RECOVERY - Puppet run on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [15:17:30] yuvipanda could you merge https://gerrit.wikimedia.org/r/#/c/302916/ so i can update grrrit-wm please? [15:17:33] RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0] [15:17:45] It was a bug in gerrit, and is fixed in gerrit 2.12.4 [15:17:57] Ah nice, paladox [15:18:02] Yep [15:18:04] Did you cherry pick that change? [15:18:07] Yep [15:18:40] thanks ottomata [15:18:58] yuvipanda It will work as it usually does but wont have any affects telling you the correct author of the patch until we update to gerrit 2.12.4 [15:19:29] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:19:40] IE if someone updates your patch from inline gerrit edit it wont tell you that you did it, it will say the original author did [15:19:41] it [15:20:02] but the patch was merged upstream and i confirmed it fixed it by looking at gerrit stream-events :) [15:20:31] Cool! [15:20:43] yuvipanda could you have a look at why qa-morebots hasent joined -releng please if you have time [15:20:45] :) [15:20:45] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [15:21:26] Paradox nope - you have to ask one of the maintainers of the bot. You can find the list on tools.wmflabs.org/?list [15:22:19] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:22:27] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [15:22:33] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:22:49] RECOVERY - Puppet run on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:23:11] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [15:23:38] RECOVERY - Puppet run on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0] [15:24:16] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:24:42] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:24:42] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:24:44] RECOVERY - Puppet run on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [15:26:22] ok [15:27:00] RECOVERY - Puppet run on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [15:27:26] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:27:46] RECOVERY - Puppet run on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [15:27:54] RECOVERY - Puppet run on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0] [15:28:40] RECOVERY - Puppet run on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [15:29:02] RECOVERY - Puppet run on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0] [15:30:30] Ok [15:30:46] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [15:32:03] RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:32:55] RECOVERY - Puppet run on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [15:32:57] RECOVERY - Puppet run on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:33:39] RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:51] 06Labs, 10Tool-Labs: Get all my tools - https://phabricator.wikimedia.org/T142099#2522626 (10Magnus) [16:22:15] 06Labs, 10Tool-Labs: Get all my tools - https://phabricator.wikimedia.org/T142099#2522626 (10AlexMonk-WMF) I counted 63 with your name listed under maintainers. Authors is a different thing, and only stored in /data/project/$tool/public_html/toolinfo.json [16:34:22] Krenair, hi im wondering if you could take a look at qa-morebots please? [16:34:30] Since it dosent seem to have joined -releng channel [17:17:27] moar OpenStack https://joinup.ec.europa.eu/community/osor/news/france-pilots-open-source-based-cloud-services [17:25:23] 06Labs, 10Tool-Labs: Get all my tools - https://phabricator.wikimedia.org/T142099#2523057 (10Magnus) 05Open>03Resolved a:03Magnus Ah, I found a whole batch of bogus me in Notes for "magnustools" on https://tools.wmflabs.org/?list OK, I guess 63 it is then. [17:26:30] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/ProgrammingGeek was created, changed by ProgrammingGeek link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/ProgrammingGeek edit summary: Created page with "{{Tools Access Request |Justification=Creating bots for use on the English Wikipedia. |Completed=false |User Name=ProgrammingGeek }}" [17:34:36] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:40:31] yuvipanda: Reminder to change the tokens on Quarry. [17:41:20] yesterday labsdb1001 crashed; we still have some unstabilities that I am trying to fix now [17:41:38] Niharika yup, will do! [17:48:25] ores-web-05 just went crazy and I can't log into it. [17:48:39] It stopped serving requests. [17:48:44] And sshing gave me "ssh_exchange_identification: Connection closed by remote host [17:48:45] " [17:49:00] Now it's not accepting my rsa key [17:49:17] chasemp, ^ anything going on that might explain this? [17:49:28] It seems to be serving requests again, but I still can't log in. [17:50:29] andrewbogott, ^ sorry if I'm overpinging. [17:50:57] halfak: let me get to a stopping point and then I'll see what I can figure out [17:51:02] kk thanks [17:51:13] * halfak will keep taking notes [17:51:14] ssh now gives "Timeout, server ores-web-05.eqiad.wmflabs not responding." [17:52:00] Horizon shows "Power state" as "Running". I logged in as fast as I could and never saw any other power state. [17:53:25] Here's a "-vv" of the ssh call: http://pastebin.ca/3673660 [17:54:48] Looks like we are getting OOM process killing [17:55:02] We should probably reduce the # of processes on the uwsgi nodes. I'll do that now. [17:56:09] !log ores reduced web workers per core from 28 to 16 [17:56:10] halfak: what project is this in? [17:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [17:56:16] ores [17:56:56] I don't see any instances with -web- in their names [17:57:00] only -worker- [17:57:23] ores-web-03, ores-web-04, ores-web-05 are all there. [17:57:33] maybe I need to add myself [17:58:13] but anyway… did you already diagnose/fix the problem? [17:58:53] ah, of course, I had 'worker' in my instance filter and that filter persists even when you change projects and views [17:59:09] I don't know what the problem is, no. [17:59:20] The machine is still unavailable to me, but it seems to be serving requests. [17:59:34] It seems that something kicked it. [17:59:34] it's definitely oom, is there reason to think there are any problems besides that? [17:59:44] (you looked at the log already, right?) [17:59:48] Yeah. [17:59:55] But it still won't take my ssh key [17:59:59] Which is surprising and new [18:00:11] ores-web-03 works fine [18:00:42] The oom-killer isn't exactly thoughtful about what it kills [18:00:45] probably killed sshd [18:00:59] Once a system is oom there's no such thing as 'surprising' :) [18:01:10] lol k [18:01:14] So, what do? [18:01:37] reboot it, and then change the behavior so it uses less ram in the future [18:01:43] kk doing. [18:02:44] !log ores rebooting ores-web-05 to address OOM [18:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [18:05:29] Hmm, horizon reports I have 5/7 floating IPs allocated right now. I don't have any instances at the moment though and I don't see where they're used. [18:05:34] (staging project) [18:05:50] ostriches: Hm… probably my fault [18:07:42] ostriches: this is going to be quite a tangle to sort out, do you mind making me a ticket? [18:07:48] (I assume you don't need >2 fips right now) [18:08:05] Yeah 2 is plenty [18:08:09] I freed a bunch of unused IPs a few weeks ago and must have done it some way that didn't get noticed by the quota system [18:08:24] yay! I can log into ores-web-05! [18:08:25] so I'll have to read some source code to figure out why that's still recorded someplace :/ [18:09:34] !log tools.lolrrit-wm Restarted grrrit-wm pod [18:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [18:10:04] andrewbogott, I just checked in on ores-web-05 and started a puppet run. It failed in a way I've never seen before. Any insights? http://pastebin.ca/3673667 [18:10:47] Also I can't create an instance :( [18:11:33] ostriches: could you be more specific? [18:11:59] "The requested instance cannot be launched as you only have 0 of your quota available." [18:12:07] (in horizon) [18:13:48] halfak: did you mess with /etc/resolv.conf by change? It is very wrong [18:14:01] andrewbogott, no I have not [18:14:10] * halfak isn't sure what that file does [18:14:33] ok, I'm going to reboot again [18:14:36] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:14:40] oh, wait, no I'm not... [18:14:42] hm [18:16:32] halfak: no idea how it got in that state but should be fixed now [18:16:43] great. Shall I try puppet again? [18:18:09] sure [18:18:34] OK looks good [18:18:39] Our memory pressure should be better too [18:43:41] lag is going back to 0, hopefully no more issues [18:43:52] happen between today and tomorrow [18:55:34] andrewbogott: thanks man, I'm back anything I can help you with? [18:56:26] chasemp: I'm looking at ostriches's problem with the 'staging' project. There's something curious happening with misc-web and the console proxy, that might be worth a look. [18:56:37] hang on, lemme get you a url [18:56:53] My spare IPs went away :) [18:57:21] ostriches: yeah, weirdly resetting your quotas didn't help [18:57:28] I think that error message must be incorrect [18:57:31] chasemp: try https://labtestspice.wikimedia.org/spice_auto.html?token=cbd740ea-181c-4d87-afc0-5fa0327465ff [18:57:38] and see if you can figure out where the disagreement [18:57:42] SecurityError: The operation is insecure. [18:57:45] andrewbogott am going to create a new instance now, let's see how that goes :) [18:57:54] chasemp: is that with https-everywhere? [18:58:00] andrewbogott: nope [18:58:07] that's slightly different from the error I got, but it doesn't surprise me that much... [18:58:25] that's ff [18:58:38] Mixed Content: The page at 'https://labtestspice.wikimedia.org/spice_auto.html?token=cbd740ea-181c-4d87-afc0-5fa0327465ff' was loaded over HTTPS, but attempted to connect to the insecure WebSocket endpoint 'ws://labtestspice.wikimedia.org:443/'. This request has been blocked; this endpoint must be available over WSS. [18:58:41] chome says [18:58:41] SecurityError: Failed to construct 'WebSocket': An insecure WebSocket connection may not be initiated from a page loaded over HTTPS. [18:58:42] so something is trying ot reach out via http [18:58:44] I think the console proxy doesn't like that the url has https but misc-web is asking it for unencrypted stuff [18:58:46] yeah it's a variation on a mixed content warning [18:58:49] wss vs ws [18:58:51] yep [18:58:54] it's screwing up generation of the ws url [18:58:56] it uses port 443 [18:59:18] 4 ppl same conclusion in 2 minutes :) [19:00:04] it's the phase of the moon, actually [19:00:04] * yuvipanda tries to be fair and balanced [19:00:35] andrewbogott: I thought we were going to hide hte link within horizon and it would be embedded there? not the case I guess [19:00:46] I had a half banked idea about how the console proxy worked I imagine [19:01:19] chasemp: it will be embedded in horizon eventually [19:01:19] I'm just trying to give you a simpler test case [19:01:25] gotcha [19:01:46] (really, horizon doesn't do anything smart, it just generates a url like the above, and then redirects you do it) [19:01:51] *to [19:02:32] do we need it to be available over the web? is just having console access from labcontrol not enough? [19:02:36] or is this the official openstack way? [19:03:12] I don't know that there is a strictly cli console access mechanism native to openstack, kvm yes but wrapped up in openstack idk [19:03:20] but I would be totally good w/ it if there were [19:03:23] even prefer [19:04:16] if it's restricted to labs ops you can just require people proxy through the cluster right? [19:04:46] no need for misc-web [19:06:20] PROBLEM - Host tools-worker-1008 is DOWN: CRITICAL - Host Unreachable (10.68.19.246) [19:06:31] ^yuvi? [19:06:35] yeah that's me [19:06:42] the instance was dOA so killed it [19:06:54] (DNS cache for the name hadn't expired from yesterday) [19:06:59] but I didn't realize it was responding to ping [19:07:00] so is new instance creation working? [19:07:15] yeah, this is an unrelated known problem when you recreate instances with the same name [19:07:23] k [19:08:26] hm, ostriches, the quota manager feels very strongly that you already have 719 instances in that project :( [19:08:35] ... [19:08:40] I promise we never had nearly that many [19:09:02] !log tools cleaned up nginx log files in tools-docker-registry-01 to fix free space warning [19:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:09:08] That exact quote though just cracked me up :p [19:09:22] That quota manager, back at it with the strong feelings :p [19:09:27] this is suspiciously, exactly the total number of instances in labs [19:09:36] 06Labs, 10Tool-Labs: Puppet disabled on tools-worker-10* - https://phabricator.wikimedia.org/T141719#2523558 (10yuvipanda) 05Open>03Resolved All the nodes with puppet disabled have been deleted and recreated! [19:10:21] 10PAWS: Paws display 502 - Bad gateway error - https://phabricator.wikimedia.org/T140578#2523561 (10yuvipanda) 05Open>03Resolved Ok, I suspect it was related to T141017, closing this for now then. Re-open if it happens again! [19:10:42] ostriches: can you try to create again? I want to make sure these silly numbers are affected by user account [19:10:50] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126#2523565 (10yuvipanda) 05Open>03Resolved I'm going to call this done because there are no more worker nodes left with the loopback configuration. [19:12:58] legoktm wanna move contentcontributor over to k8s? ) [19:13:47] same [19:14:01] andrewbogott: Yep same error [19:15:18] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2523580 (10yuvipanda) [19:15:29] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484363 (10yuvipanda) I've now moved all of the worker nodes to use direct-lvm, let's see if the hangs recur. [19:17:11] RECOVERY - Free space - all mounts on tools-docker-registry-01 is OK: OK: All targets OK [19:18:00] yuvipanda: are all workers ported to the new storage scheme? [19:18:28] chasemp yup. [19:18:51] so we are waiting to see if that helps [19:19:02] yeah. [19:19:04] and etcd nodes we haven't seen one in a week+? [19:19:07] I have a grafana dashboard now [19:19:09] nope [19:19:11] but we have a spattering of jessie VMs elsewhere that look similar [19:19:17] yeah [19:19:56] https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-worker-health [19:21:59] yuvipanda could you merge https://gerrit.wikimedia.org/r/#/c/302916/ please? [19:22:03] rest of my week is probably going to be around graphite/grafana [19:22:05] It is applied already [19:22:21] But just doint want it to be overwritten. [19:22:51] (03CR) 10Yuvipanda: [C: 032] Use message.uploader.name instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302916 (owner: 10Paladox) [19:22:55] yuvipanda i managed to get graphite/grafana both those working in labs. [19:22:56] done paladox [19:22:58] thanks [19:23:01] yw [19:23:04] :) [19:23:21] (03Merged) 10jenkins-bot: Use message.uploader.name instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302916 (owner: 10Paladox) [19:25:49] !log tools.lolrrit-wm git reset --hard origin/master && git pull origin master [19:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [19:26:23] thanks for tracking it through, paladox! [19:26:27] Your welcome [19:27:15] :) [19:27:41] !log tools.lolrrit-wm restarting grrrit-wm bot [19:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [19:35:58] yuvipanda does this https://github.com/mscdex/ssh2/commit/c1ec523ce00649fa6a5ae069529b94295babc869 look like it will reconnect to ssh [19:36:17] So that grrrit-wm will be able to reconnect without needing a manual restart [19:40:06] no idea, paladox [19:40:12] Oh [19:40:13] ok [19:40:20] do you want to try it? [19:41:04] I tryed it [19:41:07] if you run 'webservice --backend=kubernetes nodejs shell' you'll get a bash shell with the correct nodejs version installed, and so you can run npm install there after updating package.json [19:41:10] I don't really have much time to help tho [19:41:11] i updated the packages but it broke [19:41:14] Ok [19:41:28] right. so revert back, I guess. [19:41:33] Yep i did [19:41:38]