[01:09:31] !log tools killing local copy of python-requests, there seems to be a newer vesrion in prod [01:09:38] Logged the message, Master [01:50:55] RECOVERY - Puppet failure on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [01:52:25] RECOVERY - Puppet failure on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [01:58:41] RECOVERY - Puppet failure on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [02:01:19] RECOVERY - Puppet failure on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [02:01:53] RECOVERY - Puppet failure on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [02:03:56] YESSSS [02:03:57] RECOVERY - Puppet failure on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [02:06:52] RECOVERY - Puppet failure on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [02:09:58] RECOVERY - Puppet failure on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [02:10:52] RECOVERY - Puppet failure on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [02:14:10] !log tools created toolx-exec-14{01-05} [02:14:36] Logged the message, Master [02:38:21] YuviKTM: why do you have so many nicks? [02:38:49] Negative24: 'I have made a vow to Yahweh and cannot break it.' [02:39:18] ? [02:39:35] :) long story [02:40:42] I can see a bit of an explanation in -dev. Hi legopanda [02:40:48] :D [02:44:21] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:44:40] we know, shinken-wm. [02:44:44] it’ll be alright, don’t worry [02:46:43] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [0.0] [02:46:47] its always nice to know that it has your back (about 4-5 times in redundancy) [02:46:57] :) [02:46:58] yeah [02:51:27] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [02:53:19] !log tools created tools-exec-14{06-10} [02:54:01] Logged the message, Master [02:54:21] RECOVERY - Puppet failure on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [02:55:54] thank you, tools-exec-1401! [02:56:45] RECOVERY - Puppet failure on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [03:01:00] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:04:07] !log tools pooled tools-exec-1401 [03:04:12] Logged the message, Master [03:06:50] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:07:46] !log tools pooled tools-exec-1405 [03:07:51] Logged the message, Master [03:09:46] dungodung: I was wondering if you had a minute to look into my bot's cloak request, please. :) [03:13:56] !log tools pooled tools-exec-1402 [03:14:02] Logged the message, Master [03:16:28] RECOVERY - Puppet failure on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [03:16:50] RECOVERY - Puppet failure on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [03:18:23] !log tools pooled tools-exec-1403, 1404 [03:18:28] Logged the message, Master [03:21:34] !log disabled and drained continuous tasks off tools-exec-20 [03:21:38] disabled is not a valid project. [03:24:17] !log tools disabled and drained continuous tasks off tools-exec-20 to tools-exec-24 [03:24:22] Logged the message, Master [03:25:03] I love those overlooked "fill in your details here" files [03:25:33] * Negative24 facepalms *hard* [03:26:03] RECOVERY - Puppet failure on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [03:26:41] PROBLEM - Host tools-exec-21 is DOWN: CRITICAL - Host Unreachable (10.68.17.252) [03:27:29] PROBLEM - Host tools-exec-22 is DOWN: CRITICAL - Host Unreachable (10.68.17.253) [03:27:32] !log tools deleted toolx-exec-21 to 24, one task still running on tools-exec [03:28:04] Logged the message, Master [03:28:16] PROBLEM - Host tools-exec-23 is DOWN: CRITICAL - Host Unreachable (10.68.17.254) [03:29:18] years later when people look at the tools logs for today, they'll be like "who in the world was YuviKTM"??? [03:29:25] haha [03:29:26] :D [03:29:42] PROBLEM - Host tools-exec-24 is DOWN: CRITICAL - Host Unreachable (10.68.17.255) [03:30:28] !log phabricator git cloning on ssh configured and working [03:30:32] Logged the message, Master [03:31:31] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [03:31:34] !log depooled and deleted tools-exec-12 had nothing on it [03:31:34] depooled is not a valid project. [03:31:38] !log tools depooled and deleted tools-exec-12 had nothing on it [03:31:43] Logged the message, Master [03:32:18] legoPanda: wikibugs got restarted, wonder if it’s still working [03:32:45] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:53] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:33:01] YuviKTM: uhh, no idea! [03:33:02] PROBLEM - Host tools-exec-12 is DOWN: CRITICAL - Host Unreachable (10.68.17.166) [03:33:09] YuviKTM: it should have rejoined here upon restart [03:33:54] it did but I just made a comment and it didn’t say anything [03:33:54] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248181 (10yuvipanda) Alright, so I've created tools-exec-12{01-10} and tools-exec-14{01-10}. I've also pooled in tools-exec-14{01-05} and depooled almost all the old trusty nodes (except tools-exec-20, which has... [03:33:55] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:33:55] aha! [03:33:55] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:33:55] it’s here \o/ [03:41:30] RECOVERY - Puppet failure on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [03:52:57] !log tools depooled tools-exec-03 / 04 [03:53:03] Logged the message, Master [03:54:30] !log tools tools-exec-03 and -04 have been deleted a long time ago [03:54:35] Logged the message, Master [03:57:41] RECOVERY - Puppet failure on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [03:57:51] RECOVERY - Puppet failure on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [03:58:14] !log tools pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now [03:58:19] Logged the message, Master [04:00:09] !log tools pooled tools-exec-1406 and 1407 [04:00:14] Logged the message, Master [04:03:42] RECOVERY - Puppet failure on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [04:04:13] !log tools pooled tools-exec-1408 and tools-exec-1409 [04:04:18] Logged the message, Master [04:08:08] !log tools depooled tools-exec-09, apt troubles [04:08:13] Logged the message, Master [04:14:33] !log tools repooled tools-exec-09, apt troubles fixed [04:14:38] Logged the message, Master [04:19:40] !log tools rejuggle jobs again in trustyland [04:19:45] Logged the message, Master [04:23:43] !log tools repooled tools-exec-1201 is all good now [04:23:48] Logged the message, Master [04:25:01] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [04:25:42] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248203 (10yuvipanda) Ok, so tools-exec-14{01-10} are pooled now, and so are tools-exec-12{01-10} :D All old trusty instances except tools-exec-20 are deleted as well. [04:27:40] !log tools depooled tools-exec-09.eqiad.wmflabs [04:27:45] Logged the message, Master [04:28:45] RECOVERY - Puppet failure on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [04:28:49] !log tools deleted tools-exec-09 [04:28:53] Logged the message, Master [04:31:10] !log tools depooled exec-{01-05}, rejigged jobs to newer nodes [04:31:37] I think I’m going to need more nodes. [04:32:20] PROBLEM - Host tools-exec-09 is DOWN: CRITICAL - Host Unreachable (10.68.17.64) [04:33:52] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248204 (10yuvipanda) So everything in tools-exec-{01-10} has been disabled and drained of continuous jobs. [04:35:02] RECOVERY - Puppet failure on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [04:35:11] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248205 (10yuvipanda) We're going to need more nodes, I think. I'm going to add 10 more precise larges and 5 more trusty larges. Some of the nodes being decommed are xlarges too, while all of the new ones are larges. [04:39:45] !log tools delete tools-exec-10, was out of jobs [04:39:45] !log tools deplooled exec-{06-10} rejigged jobs to newer nodes [04:39:46] boo morebots [04:40:16] PROBLEM - Host tools-exec-10 is DOWN: CRITICAL - Host Unreachable (10.68.17.65) [04:41:27] !log tools killed tools-dev, nobody still ssh’d in, no crontabs [04:44:52] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248211 (10yuvipanda) Created tools-exec-121{1-9}, and just ran out of quota. [04:45:25] PROBLEM - Host tools-dev is DOWN: CRITICAL - Host Unreachable (10.68.16.8) [04:49:10] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248219 (10yuvipanda) After a little more investigation, I think 5 more precise and 5 more trusty should hold good for a long time. Let me rejig appropriately. [04:54:19] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1248227 (10yuvipanda) Looks like this is happening again, I've poked the culprits from last time to see if it's them again. [05:01:05] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1248236 (10yuvipanda) 5Open>3stalled [05:01:12] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1100333 (10yuvipanda) 5stalled>3Open [05:01:35] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1100333 (10yuvipanda) Just realized @Dfko is part of 'culprits' :) Am making dump now. [05:02:02] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1248239 (10Dfko) We started it up again to get a key dump (see previous 3 comments) @yuvipanda [05:12:57] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1248247 (10yuvipanda) Dump provided in private :) [05:24:31] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:25:37] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:25:59] PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:28:26] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:29:56] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:35:17] !log tools rebooted the newly created tools-exec-121{0-9} so they pick up appropriate idmapd behavior [05:36:06] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [05:38:24] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [05:38:44] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [05:39:09] !log tools delete tools-exec-10, was out of jobs [05:39:15] Logged the message, Master [05:39:17] !log tools deplooled exec-{06-10} rejigged jobs to newer nodes [05:39:21] Logged the message, Master [05:39:23] !log tools killed tools-dev, nobody still ssh’d in, no crontabs [05:39:27] Logged the message, Master [05:39:40] !log tools created new instances tools-exec-121{1-9} as precise [05:39:44] Logged the message, Master [05:39:48] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [05:39:55] !log tools rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly [05:39:59] Logged the message, Master [05:40:12] !log tools pooled in tools-exec-121{1-9} [05:40:16] Logged the message, Master [05:42:16] !log tools disabled and drained tools-exec-1{1-5} of continuous jobs [05:42:20] Logged the message, Master [05:43:26] RECOVERY - Puppet failure on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [05:44:14] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248280 (10yuvipanda) Created tools-exec-121{1-9} and pooled them :) Also drained tools-exec-1{1-5} of continuous jobs. Things left to do: # Wait for tools-exec-xx (anything with two digits) to have no running t... [05:44:30] RECOVERY - Puppet failure on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [05:44:50] RECOVERY - Puppet failure on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [05:44:56] RECOVERY - Puppet failure on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [05:45:37] RECOVERY - Puppet failure on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [05:46:03] RECOVERY - Puppet failure on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [05:46:05] RECOVERY - Puppet failure on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0] [05:48:45] RECOVERY - Puppet failure on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [05:49:48] 10Tool-Labs, 5Patch-For-Review: Create separate partition for /tmp on toollabs exec / web nodes - https://phabricator.wikimedia.org/T97445#1248282 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Resolved for all new nodes :D And the old ones will die soon! [05:50:11] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Harmonize VMEM available on all exec hosts - https://phabricator.wikimedia.org/T95979#1248285 (10yuvipanda) All done for new exec nodes, and the old ones are going to go away really soon :D [05:53:26] RECOVERY - Puppet failure on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [05:57:08] hi all - what is service.manifest and why does webservice not work without it (I can't find any docs or any mention of this file) [05:59:52] hi [06:00:00] Earwig: I’m just fixing that particular bug, moment [06:00:07] alright, thanks [06:00:33] Earwig: basically, it’s the replacement for bigbrother, and is also the reason nobody’s been complaining (at least to me / publicly) about dead webservices. [06:00:39] no docs yet, I was hopihng to do that this week [06:00:55] Hey! I need to create a instance for the Newletter extension project ( https://phabricator.wikimedia.org/tag/mediawiki-extensions-newsletter/ ) . But I assume the project is not created yet. I've created a subtask requesting for the creation of project ( https://phabricator.wikimedia.org/T97523 ). So, can I create a instance before the associated project is [06:00:55] created ? [06:01:07] hooray, good to know (thought it sounded familiar, but wasn't expecting that to actually be in place yet) [06:01:25] Earwig: it’s been in place for 2 weeks now. [06:01:33] Earwig: so essentially you do ‘webservice start’ and then that’s it. it should stay up [06:01:44] right [06:02:17] Earwig: which tool doesn’t have one? I think I made most of them have it... [06:02:22] except if you just created a new tool [06:02:24] copyvios [06:02:28] er wait [06:02:29] actually [06:02:33] it wasn't that one [06:02:44] it was earwigbot, which just has a static index.html file [06:03:36] Earwig: hmm, I manually put most of them in place (took me a day) [06:03:53] Earwig: and when you do a webservice start manually it puts on in place anyway (when this bug is fixed, patch in am merging atm) [06:04:50] could I just stick "web: lighttpd" in it? [06:04:59] Earwig: no, just give me like, about 30s. [06:05:39] err, make that 30 more seconds [06:05:43] * YuviKTM waits for puppet to run [06:05:45] no rush [06:06:11] Earwig: try now [06:06:54] 10Tool-Labs: Fix oscillation between 'purged' and 'latest' for several packages on toollabs - https://phabricator.wikimedia.org/T97628#1248286 (10yuvipanda) 3NEW [06:07:07] works. thanks! [06:07:21] 10Tool-Labs: Fix oscillation between 'purged' and 'latest' for several packages on toollabs - https://phabricator.wikimedia.org/T97628#1248293 (10yuvipanda) [06:07:22] Earwig: cool :) [06:07:45] Earwig: it’s missing some of bigbrother’s functionality atm (that is, it only works for webservices and not grid jobs, and also no support for custom webservices yet) [06:07:48] so unannounced. [06:07:58] m'hm [06:08:04] Earwig: also this week we had to basically shift around *all* labs instances due to hardware issues, so that has taken up my time [06:08:19] Earwig: but we have enough redundancy in place that users barely noticed all the shifting! (At least from lack of any complaints) [06:08:26] so things are getting better :) [06:08:26] uhh, yeah [06:08:30] I was gonna mention that also [06:08:36] did you notice the shifting? [06:08:46] if so, in what ways? [06:08:59] redis would’ve had some connection failures because we don’t have a redundancy model for that yet... [06:09:11] you mean the new tools-exec-12* nodes and whatnot? [06:09:25] my bot died about an hour and a half ago, but I restarted it with the standard jsub command it now seems unable to connect to IRC like usual [06:09:29] so I'm not sure what's going on with that [06:09:39] Earwig: oh, ugh. I see. do you have an error message? [06:09:54] I think I might know what’s happening (lack of public IP) [06:09:57] let me fix that [06:10:34] Earwig: is your bot running on exec-12* or exec-14*? [06:10:37] (precise or trusty) [06:10:50] unfortunately it just tells me the socket's closed and tries to restart itself to no avail [06:10:55] seems to be a 12* node [06:11:00] guess I never bothered to migrate [06:11:03] should do that soon [06:11:10] heh, I haven’t started a migrate pitch for people on grid yet [06:11:22] yeah, I sorta didn't realize that was happening in parallel to the webserver stuff [06:11:38] was a little annoying to recreate the virtualenvs and whatnot but after that it was fine [06:12:28] yeah. [06:13:07] !log tools allocating new floating IPs for the new instances, because IRC bots need them. [06:13:12] Logged the message, Master [06:13:19] I should consider having a separate queue for bots that need IRC at some point [06:13:44] Earwig: btw, if you’re running any python webservices, have you considered migrating them to the uwsgi-python server setup? [06:13:45] why would freenode require public IPs and not other services? [06:14:00] yep, I've got one on that (which I did during the precise migration) [06:14:04] sweet [06:14:06] quite happy with that [06:14:17] Earwig: freenode limits total number of connectisons from one ip [06:14:18] seems to be more snappy overall [06:14:20] ah okay [06:14:30] Earwig: yeah, is going through less layers of proxying [06:17:09] Earwig: I’m allocating public IPs, your bot should be able to connect in a few minutes. [06:17:14] sounds good [06:17:28] I’ll poke you when I’m done with the precise cluster [06:20:41] Earwig: all done. check? [06:21:10] looks good! [06:21:17] Earwig: IRC connects? [06:21:21] indeed [06:21:45] Earwig: sweet :D [06:25:04] Earwig: thanks for reporting! [06:25:09] * YuviKTM does this for the trusty nodes too [06:25:13] no problem, good luck with the other stuff [06:25:15] Earwig: you should move your other tools to trusty too at some point [06:25:16] thanks! [06:25:33] will add that to the todo list [06:25:37] Earwig: if there’s something that you think tools should do better (other than the obvious, like reliability) please do let me know. [06:25:45] sure [06:26:48] 10Tool-Labs: Install grunt on tools-labs - https://phabricator.wikimedia.org/T97629#1248303 (10Mjbmr) 3NEW [06:27:24] 10Tool-Labs: Install grunt on tools-labs - https://phabricator.wikimedia.org/T97629#1248311 (10yuvipanda) 5Open>3declined a:3yuvipanda Please use npm locally to install nodejs packages. [06:30:29] !log tools added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those [06:30:34] Logged the message, Master [06:32:49] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:33:11] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1248318 (10yuvipanda) I forgot to give the new instances public IPs, which was causing a bunch of failures for IRC bots. That has been remedied now with a lot of clicking. When this is all done I'm going to write... [06:36:28] 10Tool-Labs: Install grunt on tools-labs - https://phabricator.wikimedia.org/T97629#1248319 (10Mjbmr) >>! In T97629#1248311, @yuvipanda wrote: > Please use npm locally to install nodejs packages. It required root access for grunt, I tried. [07:02:50] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [07:16:05] 10Tool-Labs: Install grunt on tools-labs - https://phabricator.wikimedia.org/T97629#1248348 (10yuvipanda) If you run npm install in a directory with a valid package.json file it will install the module locally. [07:59:46] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Pavlo Chemist was created, changed by Pavlo Chemist link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Pavlo_Chemist edit summary: Created page with "{{Tools Access Request |Justification=To run a bot "PavloChemBot" in Ukrainian and maybe later in English Wikipedia. PavloChemBot already has bot flag in Ukrainian Wikipedia a..." [08:11:26] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Pavlo Chemist was modified, changed by Pavlo Chemist link https://wikitech.wikimedia.org/w/index.php?diff=156753 edit summary: [08:12:33] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Pavlo Chemist was modified, changed by Pavlo Chemist link https://wikitech.wikimedia.org/w/index.php?diff=156754 edit summary: link to the bot [08:54:27] PROBLEM - Puppet staleness on tools-mailrelay-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:20:30] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1248596 (10hashar) @Andrew thank you for the instances migrations! [10:28:54] 10Tool-Labs, 6Engineering-Community, 6WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#1248601 (10Qgil) Is someone planning to work on this task during the month of May? If so, please take it. If not, maybe it is better to lower its priority? [11:02:00] PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:02:38] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:03:25] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:03:37] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:04:15] PROBLEM - Puppet failure on tools-webgrid-08 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [11:04:27] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:05:18] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:05:32] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:06:04] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:06:36] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:07:06] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:07:34] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:07:44] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:08:02] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:08:41] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:08:51] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:09:41] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:09:45] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:10:31] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:11:44] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:11:54] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:11:56] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:11:58] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:12:22] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:12:30] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:12:30] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:12:46] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:12:54] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:13:50] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:13:54] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:14:10] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [11:14:40] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:14:44] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:14:48] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:15:02] PROBLEM - Puppet failure on tools-static-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:15:34] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:15:56] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:17:51] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:19:55] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:19:59] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:20:55] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:21:13] PROBLEM - Puppet failure on tools-webgrid-generic-02 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [11:21:37] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:21:51] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:21:59] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:22:05] PROBLEM - Puppet failure on tools-services-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:22:59] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:23:14] PROBLEM - Puppet failure on tools-services-02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [11:24:23] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:25:49] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:30:19] RECOVERY - Puppet failure on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [11:30:29] RECOVERY - Puppet failure on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [11:33:02] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [11:33:26] RECOVERY - Puppet failure on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [11:33:34] RECOVERY - Puppet failure on tools-webproxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:34:15] RECOVERY - Puppet failure on tools-webgrid-08 is OK: OK: Less than 1.00% above the threshold [0.0] [11:34:25] RECOVERY - Puppet failure on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [11:34:43] RECOVERY - Puppet failure on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [11:35:30] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [11:36:04] RECOVERY - Puppet failure on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [11:36:37] RECOVERY - Puppet failure on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:00] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:06] RECOVERY - Puppet failure on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:22] RECOVERY - Puppet failure on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:30] RECOVERY - Puppet failure on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:42] RECOVERY - Puppet failure on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:50] RECOVERY - Puppet failure on tools-webproxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:38:41] RECOVERY - Puppet failure on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [11:38:51] RECOVERY - Puppet failure on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [11:39:05] RECOVERY - Puppet failure on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:39:42] RECOVERY - Puppet failure on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [11:39:43] RECOVERY - Puppet failure on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [11:39:46] RECOVERY - Puppet failure on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [11:40:34] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:41:44] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:41:55] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [11:41:57] RECOVERY - Puppet failure on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [11:42:31] RECOVERY - Puppet failure on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [11:42:31] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [11:42:55] RECOVERY - Puppet failure on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [11:43:52] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [11:43:52] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:44:50] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:44:56] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [11:44:58] RECOVERY - Puppet failure on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [11:45:04] RECOVERY - Puppet failure on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:45:56] RECOVERY - Puppet failure on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [11:46:12] RECOVERY - Puppet failure on tools-webgrid-generic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:47:51] RECOVERY - Puppet failure on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [11:49:25] RECOVERY - Puppet failure on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [11:50:01] @q test [11:50:01] Sorry but I don't see this user in a channel [11:50:15] @q labs-morebots [11:50:20] @unq labs-morebots [11:50:49] RECOVERY - Puppet failure on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [11:50:57] RECOVERY - Puppet failure on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [11:51:14] I trust: .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@wikimedia/Ryan-lane (2admin), .*@wikipedia/.* (2trusted), .*@nightshade.toolserver.org (2trusted), .*@wikimedia/Krinkle (2admin), .*@[Ww]ikimedia/.* (2trusted), .*@wikipedia/Cyberpower678 (2admin), .*@wirenat2\.strw\.leidenuniv\.nl (2trusted), .*@unaffiliated/valhallasw (2trusted), .*@mediawiki/yuvipanda (2admin), .*@wikipedia/Coren (2admin), [11:51:14] @trusted [11:51:35] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [11:51:51] RECOVERY - Puppet failure on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [11:52:00] RECOVERY - Puppet failure on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [11:52:04] RECOVERY - Puppet failure on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [11:52:04] RECOVERY - Puppet failure on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:52:38] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [11:52:58] RECOVERY - Puppet failure on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:53:16] RECOVERY - Puppet failure on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:53] hello [12:03:02] hi [12:03:22] though... hold on, i'll try to figure out this myself. [12:08:19] okay, i think i'm a bit lost again. i'm not sure how to ssh onto this my new instance i created yesterday [12:08:52] wikitech.wikimedia.org shows me its internal IP address - i run ssh root@internal_ip and my ssh key get rejected [12:09:06] gets* [12:09:47] i created the ssh key on bastion and added it via the account preferenecs [12:09:50] preferences* [12:10:05] 10Wikibugs: wikibugs should notify on dependency changes - https://phabricator.wikimedia.org/T77006#1248774 (10Qgil) [12:10:27] i guess that this is not the way. what should i do to tell my instance what ssh key should it let in as root? [12:48:42] 10Tool-Labs, 6Engineering-Community, 6WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#1248823 (10Technical13) >>! In T87730#1248601, @Qgil wrote: > Is someone planning to work on this task during the month of May? If so, please take it. If not,... [12:54:52] d33tah: You shouldn't be using root directly. [12:55:41] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1248833 (10akosiaris) Here's an update on this. When a labs VM wants to contact a public IP it will use its local routing table to figure out where to send the packet. The routing table has 2 entries - d... [12:57:23] Coren: any other account would be fine as well, but none of them lets me in [12:57:44] i tried d (my bastion account) and D33tah (my wikilabs acc name) [12:57:57] oh [12:58:00] d let me in now [12:58:00] You need to use your /shell/ account name. [12:58:14] okay, nvm, now it works [13:42:48] 6Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Disable LDAP and enable admin puppet module on labstore100[12] - https://phabricator.wikimedia.org/T95559#1248935 (10coren) A good summary of the issue: NFS's protocol places a hard limit on the number of supplemental groups that can be sent... [13:56:56] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1248966 (10Andrew) Thank you for investigating, Alex! [13:58:22] Coren: Things seem to be flaky lately. Job 387832 mysteriously was dead yesterday, now this morning 387831, 387833, and 387835 are gone. In all cases it looks like the grid migrated them just before they disappeared. [14:00:31] ... And 387835 apparently returned from the dead when I tried to restart everything? Even weirder. [14:01:13] anomie: Yesterday's hardware issue (yeay!) caused us to have to do a lot of juggling. [14:12:23] andrewbogott: Hi, wb :) [14:18:10] Vivek: hello! I’m here but will vanish again shortly for breakfast. [14:18:24] Ok. [14:19:01] I have pmed you :) [14:31:52] Hi. [14:33:06] What do I have to do to get https://en.wikipedia.org/wiki/User:Technical_13/Scripts/OrphanStatus https://en.wikipedia.org/wiki/User:Zhaofeng_Li/reFill and https://en.wikipedia.org/wiki/Wikipedia:The_Wikipedia_Adventure edits added to the semi-automated section of https://tools.wmflabs.org/xtools-ec/?user=3gg5amp1e&project=en.wikipedia.org [14:36:23] EggSample: You'd need to ask one of that tool's maintainers. [14:38:22] How? The instructions on the bottom of the tool said to come here. [14:39:41] EggSample: That seems like odd istructitions to put there for the general case, although cyberpower678 often is on IRC. [14:39:48] I didn't see anything for submitting a feature request there. [14:40:16] You may want to ask T13|inClass when he comes back; I know he's working on the xtools in general. [14:40:49] Oh, I think I know him! Is that the same as T13|needsCoffee? [14:41:20] I expect so. :-) [14:44:54] How do I get back to the help me channel for wikipedia I was in yesterday? [14:45:32] I'm not sure which help channel you mean. Perhaps #wikipedia-en-help? [14:45:32] #wikipedia-en-help ? [14:45:41] I think so [14:45:42] Either way, you need to type: /join [14:45:57] cool, thanks [14:45:58] But that can depend on what IRC client you are using. [14:46:06] https://webchat.freenode.net/?channels=#wikimedia-labs [15:01:08] I was pinged? Just finished an exam and have only a minute [15:01:30] Hi T13. [15:01:39] What do I have to do to get https://en.wikipedia.org/wiki/User:Technical_13/Scripts/OrphanStatus https://en.wikipedia.org/wiki/User:Zhaofeng_Li/reFill and https://en.wikipedia.org/wiki/Wikipedia:The_Wikipedia_Adventure edits added to the semi-automated section of https://tools.wmflabs.org/xtools-ec/?user=3gg5amp1e&project=en.wikipedia.org [15:01:52] I didn't see anything for submitting a feature request there. [15:02:49] Click on the link for "bug" and submit it there or go to #xTools and type !newbug [15:02:59] Okay, thanks. [15:05:56] 10Tool-Labs-xTools: Add new semi-automated tools - https://phabricator.wikimedia.org/T97647#1249100 (103gg5amp1e) 3NEW [15:06:51] Good good. Break is over. [15:26:02] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1249149 (10hashar) Well done @akosiaris, you have been granted a coupon for your favorite drink. To be redeemed next time we see each others, just point to this task. ---- Regarding the use of a bridge, u... [15:38:44] Coren: note that in ^ Alex determined that fixing routing is probably not realistic, so we’re back to needing split horizon. [15:39:35] andrewbogott: Yeah, I read. And I've been looking at the pdns docs hoping that split horizon would be trivial as it is with bind... nope. But I do have an idea. [15:39:46] You're backing pdns with mysql, right? [15:39:59] Coren: yep, that’s right. [15:40:46] I'm thinking the solution may be simpler than trying to convince pdns to do split-horizon: have two pdns run listening to different addresses and getting their data from different views on the same table. :-) [15:42:18] Coren: That seems possible, although… yikes! [15:42:38] Coren: I don’t know if this is useful… that server already recurses selectively for internal IPs [15:42:50] So that’s the /switch/ we need, although not the behavior. [15:43:12] If there was a second dns server… how would clients know when to query one and not the other? [15:43:54] andrewbogott: The clients will query whatever we point them to; that just means that resolv.conf in labs instances should point at the "provides internal IPs" iface [15:44:18] Ah, I see. Hm. [15:44:43] Whereas anything coming from the outside will just hit the NS records that point to the "provides public IPs" iface [15:46:20] This is all predicated on you being able to tell designate to an extra column to entries it writes. We write the public IP in one (when it exists) and the private IP in the other; just have each pdns server query a view picking the right one. Only one source of authority, and no chance of divergence. [15:49:24] As far as ugly hacks go, this is a farily tame and orderly one. :-) [15:50:44] Except pdns /does/ support split horizon, doesn’t it? [15:51:00] And, doesn’t this mean that every single public entry will have to be hand-tuned and updated as instances move &c? [15:51:26] Yeah adding local hacks seems unideal [15:51:33] (Hi) [15:51:41] andrewbogott: Not according to everything I've read - though I've seen suggestions that geoip can be abused for that [15:51:51] But no - why would that need any sort of manual handling whatever? [15:52:13] Coren: ah, ok, I think I see what you mean… [15:52:29] seems possible. Where would the second pdns run? [15:53:11] andrewbogott: The simplest solution would have to have an address in 10/8 for where they run now and have one listen on that iface the other on the public IP [15:54:15] Ah, and I'm not the first one to think of this. There's a thread on pdns-users where: "Another appraoch would be to run two instances of pdns. Every instance would run on a specific ip which corresponds [15:54:15] to the subnet that you want to use." [15:55:01] Otherwise, split horizon support is described as "'maybe': you could use its Lua feature to fiddle with returning different values on a per/client basis." [15:55:13] Which seems even worse to me. [15:55:43] Would it really work to have two instances of pdns running on the same server? [15:55:59] andrewbogott: Of course it would, if they listen on different addresses. [15:56:30] init scripts, /etc/pdns... [15:56:39] I guess a second init script that points to a different config? [15:56:49] Seems like the simplest approach. [15:57:01] 'k [15:57:03] You just need to have a 'local-address=XX' [15:57:16] Where XX is either the public IP for one, or the private one for the other. [15:57:22] yeah, that makes sense. [15:57:24] oh, time for filippo’s talk [15:57:43] Bleh, that's today? My lunch just got here. :-( [16:17:26] 10Tool-Labs, 6Engineering-Community, 6WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#1249301 (10yuvipanda) I don't know if just closing the rfc on meta is enough - this needs some consensus from the tech community / toollabs admins / wikimedia... [17:13:35] YuviKTM: Just FYI, when I do the switchover, Ima going to be doing the start-nfs steps by hand, not using the script, so that I have an opportunity to check at every step that everything is going to plan. [17:13:59] After all, I didn't try a switchover in 2 years or so. [17:15:00] Coren: cool. Jus make sure that any missing steps get added to the script after, and !log a lot :) [17:15:18] Coren: what time is it in? I want to make sure I'm at a laptop then [17:15:39] 19h UTC [17:16:09] aka in 105 minutes [17:16:25] Ah cook [17:16:27] Cool [17:16:34] I'll be in office by then [17:45:59] * Core- wonders why cameron just kicked out his bouncer. [18:57:54] hasharAway: ping me if you’re back in the next hour or so? [18:58:02] or, let’s see… Krinkle, you there? [18:58:12] Yep [18:58:15] What's up [18:58:28] Can I shut down integration-puppetmaster and/or integration-slave-trusty-1013 briefly? [18:59:29] I can gracefully depool the slave in a few minutes [18:59:47] Krinkle: great, thanks. [18:59:57] The puppetmaster, I’m guessing no one will notice, it’ll just delay puppet runs slighty? [19:00:01] *slightly [19:00:07] Yeah, but I'd prefer to down the slave [19:00:14] ok [19:00:27] OK. It's depooled now [19:00:35] I need to move both eventually, just trying to get a head start on my pre-announced downtime next week. [19:00:47] ok, I’ll start with the slave in a moment. [19:03:31] hi [19:04:33] 23:00:55* wikibugs $ Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247078 (yuvipanda) Open>Resolved a:yuvipanda I've created the project and added you as admin :) Remember that the /data/project and /home mounts on instances are NFS, and should *not* be used for heavy lifting. [19:05:15] i noticed i have /dev/vda3 that has all my quota, but it doesn't seem to have any logical volumes created... can i just format it as a non-lvm partition? [19:05:42] d33tah: you should just use lvm to allocate space [19:06:00] d33tah: there are puppet classes that will do that automatically. labs::lvm::somethingsomething [19:06:14] d33tah: in special:novainstance under configure you can find a check box named labs::lvm::srv or something [19:06:25] And you can tick that to get a big /srv partition [19:06:27] okay, i'll check that [19:06:30] um… role::labs::lvm::srv [19:06:41] * andrewbogott looked it up! [19:07:15] We should document that somewhere... [19:07:35] andrewbogott: most of the old exec nodes should be gone today. New ones already pooled up [19:07:45] I'll do webgrid right after [19:07:47] YuviKTM: great! [19:08:19] Every other thing I try to do is blocked on upgraded to Juno so I’m itching to get this migration finished [19:08:25] andrewbogott: I repooled 29 new nodes yesterday. Current ones just have tasks still executing, I'm killing them one by one [19:08:28] Hehe [19:08:32] *upgrading [19:09:15] should the changes apply automatically? [19:09:37] YuviKTM: Move aborted. I paranoidly decided to start by a clean reboot of labstore1002. Wisely. It doesn't pass POST anymore. [19:09:51] Coren: :) :( [19:09:57] d33tah: On the next puppet run. That happens every 20 mins or you can force one with ‘sudo puppet agent -tv’ [19:10:11] i see, thanks [19:10:13] It is not really paranoia if they are out to get you, Coren [19:11:18] YuviKTM: Thank god I started early to get a reboot in. It would have been a timebomb. [19:11:30] +1 [19:12:16] yeah! Lucky break. [19:12:20] Coren: so I guess we wait for someone to show up at the DC and see what is up [19:14:24] That said, that is *not* what I did a reboot for or expected. I wanted to make sure the system was in perfect puppet state and ready to take over - not test the effing hardware. [19:14:47] * Coren is now really really scared about 1001 [19:15:30] YuviKTM: Yeah, I don't think I'll gain much by continuing to glower at the blank console. [19:16:42] * Coren powers the server down. [19:21:44] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Switchover Labs NFS server to labstore1002 - https://phabricator.wikimedia.org/T97219#1250136 (10coren) [19:22:23] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Switchover Labs NFS server to labstore1002 - https://phabricator.wikimedia.org/T97219#1235954 (10coren) This was planned for today at 19H UTC but is delayed because labstore1002 seems to be having hardware issues. What //is// it with Labs hardware? [19:23:04] This also mean that our safety net seems to have been untied and flat on the floor all along. [19:23:12] Krinkle: integration-slave-trusty-1013 is now back up. May I move the puppetmaster now? [19:24:29] andrewbogott: It seems Jenkins is unable to reach that instance now [19:24:32] did the IP change? [19:24:35] YuviKTM: can quarry-runner-test be deleted? Or, if not, is it ok if I cold-migrate it? [19:24:53] Krinkle: yes, probably. The current IP is 10.68.18.28 [19:24:55] andrewbogott: can’t be deleted, but let me see if it can be cold migrated. [19:25:00] it used it be 10.68.18.28 [19:25:14] Nope, still the same then [19:25:20] andrewbogott: can you give it about 5 minutes, and then cold migrate both quarry-runner-* instances? one query is actively running [19:25:27] they have a timeout of 10mins so should finish fast [19:25:37] YuviKTM: sure [19:25:47] See https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1013/configure. Tests can be done from https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1013/script e.g. println "uname -a".execute().text [19:25:50] any wmf ldap user can do so [19:26:06] Okay, it has a connection now [19:26:18] Ah, probably just took time to boot then [19:26:54] Yeah, go ahead with the puppetmaster [19:27:01] ok, thanks. [19:28:06] !log integration moved integration-puppetmaster and -slave-trusty-1013 to labvirt hardware. This involved a reboot and possible IP change. [19:28:12] Logged the message, dummy [19:30:27] !log tools depooled and deleted tools-exec-01, -05, -06 and -11. [19:30:33] Do we have a bot that checks for old iw links that should be removed? Like on https://da.wikipedia.org/wiki/Analyse ? [19:30:34] boo no morebots [19:30:59] Krinkle: ok, puppetmaster move finished, thank you. [19:31:10] YuviKTM: ready for me to move quarry instances? [19:31:15] Logged the message, Master [19:31:51] PROBLEM - Host tools-exec-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.30) [19:31:53] andrewbogott: still running. give it another 3 minutes? [19:31:56] is at 7mins [19:31:56] andrewbogott: I note some instances are being quite sluggish. Your doing? [19:32:15] tools-bastion is doing that thing again where I'm waiting a minute just for something like ls or nano to complete. [19:32:15] Coren: maybe? It would just be network saturation if it’s me. [19:32:19] * YuviKTM is noticing that too [19:32:32] PROBLEM - Host tools-exec-05 is DOWN: CRITICAL - Host Unreachable (10.68.16.34) [19:32:35] PROBLEM - Host tools-exec-11 is DOWN: CRITICAL - Host Unreachable (10.68.17.144) [19:32:47] Woah [19:32:55] the hosts are me [19:33:02] (I logged earlier, no morebots) [19:33:04] labs-morebots: ? [19:33:04] I am a logbot running on tools-exec-1215. [19:33:05] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:33:05] To log a message, type !log . [19:33:10] !log tools depooled and deleted tools-exec-01, -05, -06 and -11. [19:33:13] oh it did log [19:33:14] Logged the message, Master [19:33:15] I’m just blind [19:33:19] cpu usuage on the hosts is pretty reasonable... [19:33:48] Labstore is healthy and showing nothing huge; but there is a sudden drop in labnet1001 traffic that lasted a bit. [19:34:02] Did we just briefly lose the network? [19:34:06] Krinkle: you’re talking about bastion-01? Is it snappy now? [19:34:21] PROBLEM - Host tools-exec-06 is DOWN: CRITICAL - Host Unreachable (10.68.16.35) [19:34:31] andrewbogott: It comes and goes, it's snappy most of the time and then once or twice an hour it just stalls flat line for a minute. [19:34:32] andrewbogott: It works well for me now. [19:35:02] Seems to affect tools.wmflabs.org response as well during that same time frame, e.g. when editing php files [19:35:05] so probably NFS? [19:35:06] Krinkle: that’s probably me hogging the network during migrations. Lemme see if I can throttle the copy. [19:35:18] I’m on a slow enough connection now that I don’t notice the difference. [19:35:52] Krinkle: Nor as far as I can tell - there isn't so much as a blip visible on the server. [19:36:15] * Coren has graphs open permanently on one of his screens nowadays. [19:37:13] andrewbogott: go ahead and move them now [19:38:40] So, let’s see… network is 1gb which is 128MB which is… 131,000KB, so I should rsync --bwlimit 50000 <- Coren, check my math? [19:40:24] Krinkle: let me know if it happens again in the next few minutes. [19:40:30] Coren: I’m going to keep killing stewardbot stuff that’s running on tools-login. Can you cluebat the maintainers? [19:41:08] !log quarry cold-migrating quarry-runner-test to labvirt1003 [19:41:12] Logged the message, dummy [19:42:08] andrewbogott: Sounds about right [19:42:22] > # Temporary wrapper script for StewardBot by Snowolf [19:42:25] not so temporary [19:42:44] YuviKTM: Please don't. I'll do the clubat thing but those scripts are pretty critical for the stewards. [19:42:56] Coren: they’ve been running for days. I don’t think that’s acceptable. [19:42:57] Eff. They promised they'd have someone move them to the grid weeks ago. [19:43:34] YuviKTM: They'll probably need dev help to fix their things, the current crop of stewards have almost no techies in it. [19:44:00] YuviKTM: In the meantime, we can't hobble them if we can avoid it at all. [19:44:23] Coren: I killed the script and submitted it to the grid [19:44:32] and it seems to be running fine. [19:44:59] Coren: we shouldn’t special case, I think [19:45:04] I left a note on that script. [19:45:30] Coren: I told them before as well. It's becayse it was migrated from toolserver, where it was okay to run on login servers. [19:45:41] I pinged hte maintainers a minute ago. [19:45:44] thanks Krinkle [19:45:52] Krinkle: it seems to be running ok on the grid [19:46:15] did you move it? [19:46:19] Krinkle: yeah. [19:47:00] YuviKTM: Is there a script that starts it on the bastion? If you're doing it for them may wanna make sure whatever command they remembered to run to start it does the right thing [19:48:23] Krinkle: so I did that too. [19:49:44] Krinkle: where did you ping the maintainers? (which channel, etc?) [19:49:51] labs [19:49:52] eh [19:49:54] stewards [19:50:04] Coren just did as well [20:01:31] Yeah, I poked a bit more at labstore1002 in the hope that I could have some inkling of what's up with the damn thing but there is only so much I can do via a serial console from 1000 km away. [20:03:07] * Coren almost considered the 10h drive. [20:03:40] Coren: https://www.youtube.com/watch?v=dLk-3HPS12Q style? [20:06:05] Heh. That's clearly a spoof of some other scene. :-) [20:06:53] yeah, or maybe just ‘fuck the p…rinter!' [20:11:06] 10Tool-Labs-tools-Global-user-contributions: GUC: Russian wikis have broken url (http:/// instead of http://) - https://phabricator.wikimedia.org/T94351#1250264 (10Krinkle) [20:11:13] andrewbogott: I am around :] [20:11:32] hashar: ok — I was just messing with integration instances. I think I’m done now. [20:11:47] andrewbogott: awesome! [20:12:08] hashar: for now, at least :) [20:12:25] andrewbogott: chasemp: I guess we should get rid of our weekly checkin tomorrow since May 1st is an holiday :] [20:12:37] hashar: yeah, I figured we’d skip it. [20:12:56] it is no more in my agenda [20:12:59] so I guess we deleted it [20:13:07] hashar: are you blocked by anything other than the obvious? [20:13:09] 10Tool-Labs-tools-Global-user-contributions: GUC: Russian wikis have broken url (http:/// instead of http://) - https://phabricator.wikimedia.org/T94351#1250289 (10Krinkle) ``` MariaDB [meta_p]> SELECT * FROM wiki WHERE lang LIKE '%ru%'; +-------------------+------+--------------------------+-------------+------... [20:13:18] Oh, did you and Krinkle still need a new instance flavor? [20:13:23] 10Tool-Labs-tools-Global-user-contributions: GUC: Russian wikis have broken url (http:/// instead of http://) - https://phabricator.wikimedia.org/T94351#1250291 (10Krinkle) p:5Triage>3High a:3Krinkle [20:13:27] andrewbogott: Yeah :) [20:13:32] yeah krinkle was requesting a different disk size [20:13:36] got a link? [20:14:35] https://phabricator.wikimedia.org/T96706#1230309 [20:14:44] Can you increase ci1.medium to 40GB storage? (Like m1.medium) [20:15:10] hashar: does anything currently use the old flavor? [20:15:14] I wish we could finely tune how much cpu/mem/disk we need when we create an instance [20:15:19] I can create a new one with a new name or delete the older one... [20:15:24] !log mailman allocated 1 more public_ip to project (per JohnFLewis) [20:15:27] but I don’t know what will happen if I delete it when something is using it [20:15:27] andrewbogott: Yeah, one instance, though I don't mind what happens to that one, it's not used. [20:15:29] Logged the message, Master [20:15:35] !log mailman set MX record for mailman-three.wmflabs.org to itself [20:15:38] Krinkle: can you delete it? [20:15:39] Logged the message, Master [20:15:42] We misjudged how the size would be distributed between root and /srv/. Hence need 10GB extra [20:15:46] hashar: Can you delete it? [20:15:50] ahah time to be bold and delete the flavor and see what happens to the instance ? :D [20:16:06] Krinkle: yup [20:16:10] thanks [20:16:16] let me know when you’re clear... [20:18:27] andrewbogott: I have deleted the ci1.medium instance [20:18:28] hashar: deleted? [20:18:34] great, let me resize... [20:19:57] Krinkle or hashar, try now? [20:20:34] /usr/local/bin/vagrant-lxc-wrapper: No such file or directory [20:20:35] pfff [20:20:46] andrewbogott: I just co-ordinated with JohnFLewis about shrinking the tools-exec-wmt. a few mins downtime is fine apparently. [20:20:51] breaking the space-time continuum. [20:20:58] reality is shattering [20:20:59] :D [20:21:17] hashar: are you playing with my vagrant lxc hacks? [20:21:22] YuviKTM: ok, should I shrink it right now? [20:21:32] andrewbogott: yup! [20:21:38] ok... [20:21:46] JohnFLewis: can you verify it is fully back up once it comes back? [20:21:51] hashar: That file comes from the vagrant-lxc plugin [20:21:54] bd808: na ranting at Debian [20:21:57] YuviKTM: Sure [20:22:20] bd808: I somehow had the crazy idea of investigating / trying Vagrant [20:22:33] bd808: and thought that using LXC would be nicer than virtualbox :] [20:23:15] hashar: :) I have tried it out in labs (mediawiki-vagrant + lxc) [20:23:22] it seems to work pretty well [20:23:51] I'm not sure I love the ruby script they use in the plugin to proxy all the sudo commands [20:24:01] YuviKTM: JohnFLewis: Done, shrunk 27G. Still working ok? [20:24:22] 6Labs, 10Continuous-Integration-Infrastructure: Create an instance image like m1.small with 2 CPUs and 30GB space - https://phabricator.wikimedia.org/T96706#1250343 (10hashar) 5Open>3Resolved Resized by @andrew integration-slave-trusty-1021 https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000be1.eqia... [20:24:25] andrewbogott: all good thank you [20:24:27] (updated https://etherpad.wikimedia.org/p/tools-the-great-recompress) [20:24:29] 6Labs, 10Continuous-Integration-Infrastructure: Create an instance image like m1.small with 2 CPUs and 30GB space - https://phabricator.wikimedia.org/T96706#1250346 (10Andrew) ok, I deleted and recreated with 40G. Are the other stats still correct? [20:24:36] hashar: ok! [20:24:49] bd808: I have to figure out how to set it up on my Debian [20:24:56] andrewbogott: looks like it is [20:25:08] JohnFLewis: great, thank you. [20:25:08] * hashar grab what is left of the bottle of whiskey [20:25:21] YuviKTM: I have a report that lists all instances in need of shrinkage. Let me run... [20:25:26] Should be mostly empty at this point [20:25:42] andrewbogott: a few tools instances left. bastions, mail and cyberbot [20:27:14] 10Tool-Labs-tools-Global-user-contributions: Global user contributions: Support wildcard in username - https://phabricator.wikimedia.org/T66499#1250352 (10Krinkle) p:5Triage>3Normal a:3Krinkle [20:29:33] Labs was clearly built on a native burial ground. [20:32:45] * hashar shakes fist at sudo [20:34:06] YuviKTM: the first quarry instance is still copying. It’s taking forever because xlarge [20:37:56] YuviKTM: remaining shrinkage candidates are here: https://dpaste.de/ZT02 Note that my script can’t tell the difference between something that needs shrinking and something that is full up. [20:38:15] so, for instance, those deployment-prep instances can’t actually shrink any more. [20:53:27] !log quarry moving quarry-runner-01 to labvirt1004 [20:53:31] Logged the message, dummy [21:45:25] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Pavlo Chemist was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=156843 edit summary: [21:54:57] andrewbogott: did the quarry stuff all move? [21:55:39] YuviKTM: the runners but not ‘main'. [21:55:45] I can move main now if that’s ok [21:55:49] andrewbogott: you can do that too, yaeh. [21:57:56] !log quarry moving quarry-main-01 to labvirt1003 [21:58:00] Logged the message, dummy [21:59:51] * YuviKTM moves floors again [21:59:53] so much moving [22:05:51] YuviKTM: ok, that’s it for quarry. [22:08:37] andrewbogott: sweet. [23:17:21] sitic: done on both requests, btw (phab and gerrit) [23:17:31] sitic: you can push your current git repo to the gerrit repo - it allows direct push for first commit [23:17:37] let me know if you want me to do that for you [23:47:40] 10Tool-Labs, 6Engineering-Community, 6WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#1251068 (10Technical13) >>! In T87730#1249301, @yuvipanda wrote: > I don't know if just closing the rfc on meta is enough - this needs some > consensus from th...