[00:12:02] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1950597 (10chasemp) In looking at poor client behavior it seems prudent to poke at the server. A few weeks ago I added some instrumentation to get statistics from labstore1001 (a few general... [00:17:07] tgr: ok [00:19:59] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1950641 (10chasemp) A few notes on `tools-webgrid-lighttpd-1201.tools.eqiad.wmflabs` > top - 00:17:20 up 21 days, 21:02, 0 users, load average: 168.20, 168.12, 167.9 ---- {P2503} Sam... [00:20:03] tgr: tried on test.wikidata and i'm apparently missing the token param [00:20:04] https://phabricator.wikimedia.org/P2505 [00:20:31] thanks! [00:20:37] is that pywikibot? [00:20:52] my own scripts (php) [00:21:03] is it available somewhere? [00:21:08] suppose i might have to update them [00:21:21] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1950651 (10chasemp) tools-webgrid-lighttpd-1201 did not respond to soft reboot via salt and I rebooted it the hard way [00:23:57] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1950667 (10chasemp) 5Open>3Resolved https://phabricator.wikimedia.org/T124133#1950641 https://phabricator.wikimedia.org/T124133#1950651 [00:24:00] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1950669 (10chasemp) [00:25:12] tgr: the code's is still WIP but here: https://github.com/filbertkm/wikibot [00:25:28] and never had a problem with the login parts [00:31:12] aude: thanks! the related bug is https://phabricator.wikimedia.org/T124252 [00:32:26] ok [02:41:54] Krenair: can I delete labs-dnsrecursor2.openstack? Or is it still doing things? [02:42:22] you can delete it [02:43:46] thanks [02:48:51] andrewbogott: are you clearing out things with broken puppet? [02:49:19] just randomly picking things that complain about apt-get update while the kernel update script runs [02:49:26] nothing comprehensive [02:51:21] nice [02:55:52] bd808: what about the logstash project? puppet’s broken there too [02:56:10] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node puppetmaster.logstash.eqiad.wmflabs: Failed to find puppetmaster.logstash.eqiad.wmflabs via exec: Execution of '/usr/local/bin/ldap-yaml-enc.py puppetmaster.logstash.eqiad.wmflabs' returned 1: [03:05:47] aude: can you tell me how to set up wikibot? [03:06:03] I reproduced what it seems to do in curl and that gets me logged in fine [03:06:34] there must be some subtle difference but I can't find it by reading the code [03:06:57] tgr: i can now login and edit my own mediawiki instance (master) [03:07:27] but still can't on test.wikidata? [03:07:44] andrewbogott: you can kill the current instances in the logstash project, but leave the project itself please. [03:07:47] let me try again [03:07:52] bd808: ok, thanks [03:08:02] Thank you [03:08:07] i had to fix https://gerrit.wikimedia.org/r/#/c/265449/ but think it is unrelated [03:08:28] not sure if i need to use bot passwords now or not [03:09:11] you can but don't need to for at least a month [03:21:30] YuviPanda: https://gerrit.wikimedia.org/r/#/c/265451/ [03:21:37] tgr: i added some documentation and sample configs [03:21:44] but can try again myself [03:21:50] I think that 'apt-get update’ has been timing out on pretty much all trusty instances for ages [03:22:56] andrewbogott: uh, but I did an rm with salt [03:23:11] ah, ok, that explains why it’s present some places but not all :) [03:23:15] Any objection to the patch? [03:23:17] :) [03:23:34] andrewbogott: I still think we should just do an rm with clush instead [03:23:37] but no objections as such ;) [03:23:42] we can remove it in a few weeks [03:23:51] ok, I’ll do it with clush [03:25:17] tgr: doesn't help that i can't reproduce on my devwiki [03:26:46] tgr: you can ply with pywikibot at https://tools.wmflabs.org/paws [03:26:50] *play [03:27:51] Don't try to use your (WMF) account though. I'm pretty sure it still has issues with usernames that have non-alphanumeric chars [03:30:04] tgr: i get something on test.wikidata about my account not existing [03:30:14] and not on my devwiki [03:31:31] (account does exist on test.wikidata and i have the right password for it) [03:35:29] that sounds like the bug anomie patched today with the id being looked up locally rather than from centralauth [03:35:44] could be [03:35:59] but test.wikidata should have that patch [03:36:25] YuviPanda: lol. I just saw the namespace id for Hiera: [03:36:59] bd808: what did I pick? [03:37:03] 666 [03:37:05] I remember I did something I thought was funny [03:37:07] right [03:37:17] It's a shame since I wanted it for Notebook: [03:37:22] and was looking for it [03:37:33] bd808: that was an oauth patch, wikibot uses the login API [03:37:34] and then 'some dickhead had taken it already!' [03:37:44] 668: neighbor of the beast [03:37:47] :D [03:37:54] 667: Talk page of teh Beast [03:38:08] tgr: pwb with PAWS uses OAuth [03:38:11] not bot passwords tho [03:38:13] but full on OAuth [03:38:16] Do we use the Project: namespace on wikitech for anything? [03:38:48] https://gist.github.com/filbertkm/c1e14a3fbda62fd5db88 [03:39:12] ^ these are the steps taken to login and the responses (sans my tokens ,etc) [03:40:28] YuviPanda: if https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen can be believed there were 400 login errors per second [03:40:43] ouch [03:40:48] no way there are enough clients using OAuth or bot passwords [03:40:54] ...for that [03:41:31] PWB with OAuth wouldn't call the login API anyway [03:41:45] botpasswords doesn't even yet work for wikidata (until i put my patch in swat) [03:41:51] not for wikibase-specific things [03:42:48] aude: thanks but the trick will be somewere in the cookie handling [03:42:55] ok [03:43:06] I can login rto test.wikidata by doing those steps in curl [03:43:28] and sending back the session cookie in the second step [03:43:43] the bot must do some small detail differently [03:43:54] ok [03:44:01] i might be missing that [03:44:03] or maybe the account you use is different in some relevant way, I have no idea [03:44:18] though not necessary on my devwiki [03:44:31] i'm using my AudeBot account on testwikidata [03:55:46] aude: what command were you using to test the login on testwikidata? [03:56:56] tgr: i tried testwikidata with my non-bot account and that works [03:57:06] i'm trying to set a label [03:57:13] (let me try again with my bot account) [03:58:41] and now AudeBot works :/ [03:58:50] i don't know how [03:59:03] I tried 'app/console set-label testwikidatawiki Q100 Test 1' and that gave me a serialization error [04:00:16] nevermind, it works if I give it a base revision that actually exists [04:00:17] ./app/console set-label testwikidatawiki Q583 wzgAhEff 25580 [04:00:27] obviously i need to make the baserev part automatic :) [04:00:44] (normally have been using this stuff to debug and test things, or one off tasks) [04:01:02] probably revision 1 is the main page (e.g. wikitext) [04:06:06] in any case, thanks for looking into it [04:06:15] sure [04:06:22] it's strange that i can login now [04:06:43] i did end up logged in as my bot on wikidata (on the site) [04:07:02] and did a bot edit on testwikidata as Aude [04:07:36] something must have caused it to work [04:15:30] !log tools.stashbot restarted to fix https://github.com/bd808/tools-stashbot/issues/1 [04:19:26] tgr: you were saying that pywikibot already uses oauth for bots? [04:19:35] or has already adapted to the changes? [04:20:28] aude: if the bot operator has set it up, yes [04:20:38] https://www.mediawiki.org/wiki/Manual:Pywikibot/OAuth [04:20:41] ok [04:21:00] i just want to make sure the wikidata communtiy gets informed (though bot authors should already be on wikitech, etc) [04:24:37] aude: https://lists.wikimedia.org/pipermail/wikitech-l/2016-January/084501.html has the details but short story is ideally your bot should do oauth, if it can't you can go to Special:BotPasswords and set up a special username/password for which no changes in bot code are needed [04:25:16] tgr: i saw the mail, yes [04:25:35] i have some code for oauth (or could use one of the libraries for that) [04:26:09] adding to the wikidata newsletter https://www.wikidata.org/wiki/Wikidata:Status_updates/Next#Other_Noteworthy_Stuff [04:26:23] to help make sure people see this [04:28:07] legoktm: can you poke morebots to get it back in this channel? [04:28:22] * aude needs sleep now :) [04:28:29] uh, lets see [04:28:42] hope you figure out the login problem soon [04:29:56] tools.morebots@tools-bastion-01:~$ qmod -rj labs-logbot [04:29:56] Pushed rescheduling of job 228481 on host tools-exec-1219.eqiad.wmflabs [04:30:02] !log tools.morebots restarted labs-logbot [04:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.morebots/SAL, Master [04:30:09] bd808: ^ [04:30:16] yay! [04:30:25] thanks legoktm [04:30:28] np [04:31:03] oops [04:32:21] uhh [04:33:02] It did log the message before dying so probably not another session bug [04:33:59] bd808: https://tools.wmflabs.org/?tool=morebots [04:34:05] enjoy! :) [04:34:22] oh no! I asked too many questions! [04:39:12] hmmm... "Died in main event loop" [04:39:29] from a KeyboardInterrupt? [04:39:42] I think that means it used too much memory? [04:41:55] !log tools.morebots Restarted labs-morebots with ./labs.sh [04:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.morebots/SAL, Master [04:43:49] well it didn't die again yet which is nice [04:44:12] !log tools.bd808-test Testing labs-morebots a bit [04:44:42] labs-morebots: hello? [04:44:42] I am a logbot running on tools-exec-1207. [04:44:43] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [04:44:43] To log a message, type !log . [04:45:49] !log tools.bd808-test Testing labs-morebots take 2 [04:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bd808-test/SAL, Master [05:04:53] aude: did you maybe try to manually log in on testwikidata with your bot account and fail? [05:07:17] also, do you remember how many times you got the notoken error? [05:07:25] * tgr is trying to make sense of the logs [06:40:08] 6Labs, 10wikitech.wikimedia.org: "action=formedit" doesn't work any more - https://phabricator.wikimedia.org/T124248#1951237 (10Luke081515) Works for me... [07:24:00] 6Labs, 10wikitech.wikimedia.org: "action=formedit" doesn't work any more - https://phabricator.wikimedia.org/T124248#1951247 (10Florian) [07:24:02] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1951248 (10Florian) [07:24:59] 6Labs, 10wikitech.wikimedia.org: "action=formedit" doesn't work any more - https://phabricator.wikimedia.org/T124248#1950333 (10Florian) The most important comment of the duplicate task is [[ https://phabricator.wikimedia.org/T123583#1950540 | this one, posted by Reedy ]] :) [08:46:14] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1951317 (10WMDE-leszek) 5Open>3Resolved a:3WMDE-leszek Thank you all for trying to help us getting access to the instance again. As we needed to update software running on the instance a... [08:46:28] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1951320 (10WMDE-leszek) 5Resolved>3declined [09:31:44] 6Labs, 10Tool-Labs: Replication lag starting a script on tools.taxonbot - https://phabricator.wikimedia.org/T124172#1951359 (10jcrespo) 5Open>3Resolved a:3jcrespo All lag finally went away at 3:25 UTC. [09:35:21] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 6operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1951375 (10hashar) [09:43:50] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 6operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1951388 (10faidon) >>! In T50501#527689, @Krinkle wrote: > Would it be an option to flatten our subdomains? > > We'd only need b... [14:52:51] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#1951866 (10Aklapper) IMPORTANT: **If you are a community developer interested in working on this task:** The [[ https://www.mediawiki.org/wiki/Wikimedia_Hackath... [15:13:28] JOIN [15:13:50] CON LUIS CORRAL [15:15:08] MAS PUTO FUCKING ANSWER ME HELPPPPPPP! [16:35:12] 10Tool-Labs-tools-Other: 504 error for Autolist 2 on https://tools.wmflabs.org/autolist/ - https://phabricator.wikimedia.org/T124280#1952374 (10JEumerus) Probably the former, seeing as other Tool Labs functions still work fine. [16:42:46] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-webgrid-lighttpd-1412 is not accessible by ssh - https://phabricator.wikimedia.org/T124304#1952419 (10scfc) 3NEW [17:09:31] PROBLEM - Host tools-redis-01 is DOWN: CRITICAL - Host Unreachable (10.68.18.70) [17:10:26] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-webgrid-lighttpd-1412 is not accessible by ssh - https://phabricator.wikimedia.org/T124304#1952472 (10chasemp) blargh missing host from salt and without prior console setup I'm stuck. I did grab the running jobs on this host {P2507} seems pretty small? may... [17:10:48] andrewbogott: it seems a bunch of tools are broken :/ like autolist and https://vital-signs.wmflabs.org/ (maybe it's own labs project?) [17:12:11] aude: I’m in the midst of a maintenance thing that’s going to break lots of things. Can you check back later in the day and see if there’s still breakage? [17:12:22] ok [17:14:18] 10Tool-Labs-tools-Other: 504 error for Autolist 2 on https://tools.wmflabs.org/autolist/ - https://phabricator.wikimedia.org/T124280#1952477 (10aude) asked in irc and told there is maintenance (across all of labs) going on and we need to check back in a bit [17:15:55] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-webgrid-lighttpd-1412 is not accessible by ssh - https://phabricator.wikimedia.org/T124304#1952480 (10chasemp) back post reboot for now [17:16:25] hi [17:17:04] i have an instance in labs that shut off by its own and I can't get it to reboot. can somebody help me out? [17:17:13] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-webgrid-lighttpd-1412 is not accessible by ssh - https://phabricator.wikimedia.org/T124304#1952483 (10chasemp) fwiw it seems like it should be a valid salt client root@labcontrol1001:~# salt-key -L | grep 1412 tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs... [17:17:23] joakino: I’m rebooting all instances today to update kernels. What instance? [17:17:37] andrewbogott: maybe set a notice in motd? [17:17:39] andrewbogott: stack.reading-web-staging.eqiad.wmflabs [17:17:42] joakino: (this was announced on labs-l and labs-announce, I encourage you to subscribe if you are not already) [17:17:46] chasemp: isn’t it? [17:17:52] oh maybe so :) [17:17:57] oh ok andrewbogott, will do, didn't know about those lists [17:17:59] thanks [17:18:16] joakino: yes, that’s one of the ones I’m rebooting right now. It should be back in 5-10 minutes. [17:18:35] alright, thanks! sorry for the spam :D [17:18:45] joakino: I think you can subscribe here: https://lists.wikimedia.org/mailman/listinfo/labs-l [17:18:51] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1952488 (10chasemp) have we had non-webgrid examples? [17:19:08] that's what I'm doing :D [17:19:12] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1952492 (10chasemp) [17:19:35] 6Labs, 10Tool-Labs, 7Mail: Move tools-mail to trusty - https://phabricator.wikimedia.org/T96299#1952495 (10coren) a:5coren>3None [17:20:41] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Labs: update image builders to use new PAM scheme - https://phabricator.wikimedia.org/T120710#1952504 (10coren) a:5coren>3None [17:20:58] 6Labs, 10Labs-Sprint-115, 10Tool-Labs, 10labs-sprint-116, and 3 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1952506 (10coren) a:5coren>3None [17:21:12] 6Labs, 10Labs-Infrastructure, 10Labs-Sprint-102, 6operations, 10ops-eqiad: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1952507 (10coren) a:5coren>3None [17:23:44] instances on 1008 are reviving now. I’m going to step away while that finishes up… then wait for yuvi before I reboot anything else. (partly that gets us half an hour to verify that the new kernel isn’t a total disaster.) [17:25:36] joakino: your instance should be back now — does it look ok? [17:28:24] andrewbogott: I can ssh in now fine [17:28:32] great [17:29:33] RECOVERY - Host tools-redis-01 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [17:39:03] 6Labs: tools replication is failing between labstore1001 and labstore1002 - https://phabricator.wikimedia.org/T124310#1952556 (10chasemp) 3NEW [17:39:13] 6Labs: tools replication is failing between labstore1001 and labstore2001 - https://phabricator.wikimedia.org/T124310#1952564 (10chasemp) [17:57:20] YuviPanda: ping me when you arrive? [18:04:12] andrewbogott: ping [18:04:18] howdy! [18:04:21] I hope waking up wasn’t too painful [18:04:54] it always is :D [18:04:58] I rebooted 1008 already. Do you have any preference about which one I do next? [18:05:08] https://etherpad.wikimedia.org/p/tools-reboots-cve-0728 [18:05:13] let's find one that doesn't need failovering [18:05:28] which is most of 'em [18:05:31] so you can hit 1001 nex [18:05:39] ok. Need to do anything before I do? [18:06:10] andrewbogott: nope [18:06:21] ok then, here we go [18:09:24] YuviPanda: I gave 10 seconds between starts last time, with no problems. Going to try 5 this time. [18:09:38] * jimmyxu would greatly appreciate we run shutdown with like 60s wall notice in the future :p [18:10:25] andrewbogott: ok [18:10:33] jimmyxu: we emailed labs-l and labs-announce :) [18:10:38] jimmyxu: you mean, like, announce here that I’m going to reboot 1001 a minute before I do it? [18:10:47] but noted, and will do it for bastion-01 [18:11:11] I tend to assume that no one knows what host their vm is on, so figured it wouldn’t be useful [18:11:37] andrewbogott: rather than the wall message (as in, you'll get this on your terminal if you're logged in) shutdown automatically prints [18:11:53] ah I see [18:12:00] that’s because I’m not strictly speaking shutting down the VMs [18:12:01] andrewbogott: was running vim and got "the system is shutting down NOW" and did a panic save successfully :p [18:12:08] I’m shutting down the host that contains them [18:12:22] Running a shutdown command on each individual host would be… [18:12:28] oh.. I guess tools-dev got signalled somehow then [18:12:31] well, maybe possible, but inconsistent and a lot of trouble [18:12:33] that's more understandable [18:12:47] yeah, I’m sure that KVM sends a proper shutdown notice when it gets one from the host [18:12:56] PROBLEM - Host tools-webgrid-lighttpd-1410 is DOWN: CRITICAL - Host Unreachable (10.68.18.44) [18:12:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:19] PROBLEM - Host tools-exec-1204 is DOWN: CRITICAL - Host Unreachable (10.68.17.88) [18:13:21] PROBLEM - Host tools-exec-1408 is DOWN: CRITICAL - Host Unreachable (10.68.18.14) [18:13:30] ^ is fine [18:13:35] it'll all be ok [18:13:53] PROBLEM - Host tools-exec-1202 is DOWN: CRITICAL - Host Unreachable (10.68.16.57) [18:13:53] PROBLEM - Host tools-webgrid-generic-1404 is DOWN: CRITICAL - Host Unreachable (10.68.18.53) [18:13:59] PROBLEM - Host tools-webgrid-lighttpd-1411 is DOWN: CRITICAL - Host Unreachable (10.68.17.51) [18:14:15] YuviPanda: ok, 1002 next? [18:14:21] PROBLEM - Host tools-exec-cyberbot is DOWN: CRITICAL - Host Unreachable (10.68.16.39) [18:14:31] PROBLEM - Host tools-exec-1206 is DOWN: CRITICAL - Host Unreachable (10.68.17.105) [18:14:35] PROBLEM - Host tools-webgrid-generic-1405 is DOWN: CRITICAL - Host Unreachable (10.68.16.110) [18:14:39] once this fully comes back up, yeah [18:14:42] 5s no problem? [18:14:46] I can silence teh http://tools.wmflabs.org/ for a bit? [18:15:03] PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [18:15:10] chasemp: yeah, that's ag ood idea! [18:15:15] PROBLEM - Host tools-exec-1201 is DOWN: CRITICAL - Host Unreachable (10.68.17.49) [18:15:16] kk [18:15:19] PROBLEM - Host tools-exec-1213 is DOWN: CRITICAL - Host Unreachable (10.68.17.252) [18:15:27] PROBLEM - Host tools-exec-1209 is DOWN: CRITICAL - Host Unreachable (10.68.17.129) [18:15:55] PROBLEM - Host tools-puppetmaster-01 is DOWN: CRITICAL - Host Unreachable (10.68.22.61) [18:16:06] YuviPanda: 1001 instances are still waking up, but so far 5s seems fine [18:16:13] PROBLEM - Host tools-exec-1217 is DOWN: CRITICAL - Host Unreachable (10.68.18.20) [18:16:19] ok [18:16:30] http://brojsimpson.com/wordpress/wp-content/uploads/2011/11/its-gonna-be-ok.jpg [18:16:30] I wanna see how the exec nodes did [18:16:33] when they come back up [18:17:14] PROBLEM - Host tools-exec-1218 is DOWN: CRITICAL - Host Unreachable (10.68.18.19) [18:17:35] btw, yuvi, you saw that 3.19 got sorted out? [18:17:41] PROBLEM - Host tools-webgrid-lighttpd-1409 is DOWN: CRITICAL - Host Unreachable (10.68.18.43) [18:17:56] tools-exec-1201 should be back up now [18:18:06] andrewbogott: yeah [18:18:14] YuviPanda, andrewbogott: instances outside tools affected too? [18:18:21] RECOVERY - Host tools-exec-1204 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [18:18:22] Luke081515: all of labs [18:18:36] thanks. I just wodnered, why I get a 500 [18:18:47] andrewbogott: yes. thanks for doing that :) [18:18:48] YuviPanda: all exec nodes should be back up [18:18:51] RECOVERY - Host tools-exec-1202 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [18:19:23] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [18:19:23] ok let me check [18:19:31] andrewbogott: Can you pelase ping me, if this is fixed? [18:19:33] RECOVERY - Host tools-exec-1206 is UP: PING OK - Packet loss = 0%, RTA = 3.23 ms [18:19:37] RECOVERY - Host tools-webgrid-generic-1405 is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [18:19:58] Luke081515, that last outage should be over by now, but there may be others. 8 more virt hosts to reboot yet. [18:20:04] RECOVERY - Host tools-bastion-02 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [18:20:16] RECOVERY - Host tools-exec-1201 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [18:20:20] RECOVERY - Host tools-exec-1213 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [18:20:27] andrewbogott: Thanks, my instance is back now ;) [18:20:28] RECOVERY - Host tools-exec-1209 is UP: PING OK - Packet loss = 0%, RTA = 2.80 ms [18:20:56] RECOVERY - Host tools-puppetmaster-01 is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [18:21:14] RECOVERY - Host tools-exec-1217 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [18:22:00] hmmmm [18:22:09] * andrewbogott cringes [18:22:17] RECOVERY - Host tools-exec-1218 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [18:22:22] * YuviPanda is still doing checks [18:22:43] RECOVERY - Host tools-webgrid-lighttpd-1409 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [18:22:49] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 982995 bytes in 3.173 second response time [18:22:55] RECOVERY - Host tools-webgrid-lighttpd-1410 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [18:23:08] oookay [18:23:21] andrewbogott: so it looks like gridengine still thinks the jobs that have been running on those nodes [18:23:23] RECOVERY - Host tools-exec-1408 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [18:23:24] are still running [18:23:26] despite evidence [18:23:28] to the contrary [18:23:30] fun [18:23:39] yay gridengine [18:23:51] RECOVERY - Host tools-webgrid-generic-1404 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:23:54] the clustering setup where if it *never* loses jobs, even if it does! [18:24:01] RECOVERY - Host tools-webgrid-lighttpd-1411 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [18:24:03] maybe it just needs time for things to timeout? [18:24:08] possibly [18:24:12] so I'm going to give it another minute [18:25:20] * andrewbogott peels an orange [18:26:01] hah [18:26:04] I'm out of oranges [18:26:06] unfortunately [18:28:04] I think this was a satsuma, strictly speaking [18:28:19] ah [18:28:22] 'citrus' [18:28:28] it hasn't noticed still [18:28:43] so maybe we restarted the exec nodes too soon? Would it be smarter if the nodes went down and stayed down? [18:28:57] I think [18:29:01] we'll just drain them of jobs explicitly [18:29:03] for the next time [18:29:05] and I've to do tha tnow [18:29:07] *that now [18:29:33] ok. You’re going to drain the ones from 1001 as well? [18:29:35] also fuck SGE, etc. this is like, the underlying bedrock of what a clustering setup should do [18:29:37] yeah [18:29:39] going to do that now [18:29:59] right, the whole point is for it to notice when a node goes down [18:30:21] yeah [18:30:52] hm, some of mine were restarted [18:36:27] YuviPanda: anything I can do to help? [18:36:32] I restarted them allllllll [18:36:56] ok, ready for 1002 to go down then? [18:37:00] let me look [18:37:10] andrewbogott: we should drain the exec nodes prior to restarting this time [18:37:19] oh, that’s what I thought you were doing, sorry [18:37:29] I was draining [18:37:31] 1001 [18:37:31] great, now it's gone [18:37:34] or at least [18:37:35] resetting them [18:38:13] !log tools restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead [18:38:17] ok, but you’ll drain the 1002 ones as well? Or shall I? (Confused by your use of ‘we’ :) ) [18:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:38:20] yeah [18:38:22] andrewbogott: am doing it [18:38:27] ok! [18:38:33] I'm actually making it into a tiny script [18:38:35] so hold on [18:45:05] ok [18:45:07] script done [18:46:24] !log tools drained and disabled queues on all nodes on labvirt1002 [18:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:46:36] cool. Ready for reboot? [18:46:41] yeah [18:46:49] I think [18:46:53] this reboot will kill wikibugs [18:46:57] since there'll be a redis connection reset [18:47:01] and we've to restart it manually [18:47:03] after [18:47:05] I can do that [18:47:07] anyway [18:47:09] go on [18:47:11] :) [18:48:25] PROBLEM - Host tools-worker-1003 is DOWN: CRITICAL - Host Unreachable (10.68.17.58) [18:48:40] Cyberpower678: alive? [18:49:08] AzaToth, no. I died yesterday. This is his ghost speaking [18:49:22] good [18:49:37] I notice https://tools.wmflabs.org/xtools-articleinfo/ is fubar Cyberpower678 [18:50:03] is it something that has been removed/moved/etc...? [18:51:32] AzaToth, it would appear someone broke it [18:51:47] I only restart the tools when needed. [18:51:57] I don't do much maintainence anymore. [18:52:07] oh [18:52:15] you are listed as "maintainer" [18:52:26] So I can restart it if I need to. [18:52:37] But I don't tinker with the code anymore. [18:53:03] I've moved on from it when I no longer felt an pleasure maintaining the coed. [18:53:17] PROBLEM - Host tools-services-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.29) [18:53:59] AzaToth, I like writing my own code and maintaining that much better, or maintaining code I actively use myself. [18:54:06] np [18:54:20] PROBLEM - Host tools-redis-1001 is DOWN: CRITICAL - Host Unreachable (10.68.22.56) [18:54:46] PROBLEM - Host tools-webgrid-lighttpd-1209 is DOWN: CRITICAL - Host Unreachable (10.68.17.152) [18:54:46] AzaToth, yea, sorry. Maintaining xTools was driving me to an early wiki retirment. [18:54:51] hehe [18:55:08] PROBLEM - Host tools-exec-1203 is DOWN: CRITICAL - Host Unreachable (10.68.16.133) [18:55:16] just put it on YuviPanda then instead? [18:55:16] * Cyberpower678 now has his own big project. [18:55:29] InternetArchiveBot [18:55:42] hmm [18:55:44] PROBLEM - Host tools-submit is DOWN: CRITICAL - Host Unreachable (10.68.17.1) [18:55:58] in what context? [18:56:04] PROBLEM - Host tools-webgrid-generic-1403 is DOWN: CRITICAL - Host Unreachable (10.68.18.52) [18:56:04] I started this project before it was even mentioned on the community wishlist. [18:56:17] there's a community wishlist? [18:56:20] PROBLEM - Host tools-exec-1214 is DOWN: CRITICAL - Host Unreachable (10.68.17.253) [18:56:22] Yes [18:56:28] I didn't know that either [18:56:36] PROBLEM - Host tools-webgrid-lighttpd-1204 is DOWN: CRITICAL - Host Unreachable (10.68.18.49) [18:56:54] YuviPanda: 1002 instances coming up now. What’s next? [18:56:54] PROBLEM - Host tools-exec-1405 is DOWN: CRITICAL - Host Unreachable (10.68.18.3) [18:56:58]