[00:01:12] 6Labs, 10Tool-Labs: Migrate individual tools to trusty to relieve pressure on older precise nodes - https://phabricator.wikimedia.org/T88228#1530036 (10Ricordisamoa) We can at least 'strongly encourage' developers to trustize their tools, can't we? [00:28:37] 6Labs, 10Tool-Labs: Migrate individual tools to trusty to relieve pressure on older precise nodes - https://phabricator.wikimedia.org/T88228#1530093 (10scfc) No, I cannot. As I pointed out pushing developers now to move to Trusty only to have to move to Jessie in the already foreseeable future would create un... [01:37:37] is it known issue that VE doesn't work on wikitech? [01:44:34] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: visual editor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530322 (10Dzahn) 3NEW [01:45:49] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: visual editor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530332 (10Dzahn) 18:41 < gwicke> I think that has been broken since the last bigger VE deploy 18:43 < gwicke> James_F, is there a task for the wikitech issue? [01:46:09] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: visual editor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530333 (10Dzahn) [01:51:56] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: visual editor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530336 (10Jdforrester-WMF) Meh. [01:52:53] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530337 (10Jdforrester-WMF) a:3Krenair [03:45:22] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530469 (10Krenair) ```error: Config Request failure for "https://wikitech.wikimedia.org/w/api.php": 404 path: /labswiki/Main_Page Error: Config Request failure for "https:... [03:48:28] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530481 (10Krenair) If that's Parsoid trying to hit Varnish, see T102178 [03:50:13] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor on wikitech fails - https://phabricator.wikimedia.org/T108776#1530492 (10Krenair) You can reproduce that error easily (without manually editing the VE error handling code live on silver) by browsing to http://parsoid-lb.eqiad.wikimedi... [04:03:21] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530494 (10Krenair) a:5Krenair>3GWicke [04:07:05] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530497 (10GWicke) This looks like a Parsoid <-> wikitech API communication issue to me. Moving on to Parsoid. [04:07:39] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530498 (10GWicke) a:5GWicke>3Arlolra [04:09:38] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530501 (10Krenair) Might be due to `lib/mediawiki.ParsoidConfig.js`'s check for private/fishbowl wikis, which would turn off use of the default proxy. But since https://gerrit.... [04:12:07] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530502 (10Krenair) https://gerrit.wikimedia.org/r/#/c/228024/1/lib/sitematrix.json - seems like Parsoid has it's own local version of sitematrix and wikitech was updated to no... [04:16:37] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530508 (10GWicke) @Krenair, nice research! [09:46:18] 10Tool-Labs-tools-Erwin's-tools, 5Patch-For-Review: Migrate https://toolserver.org/~erwin85/talkcatintersect.php to Tool Labs - https://phabricator.wikimedia.org/T62874#1531065 (10Magnus) Note that you can replicate this functionality by subsetting two result sets from [[ https://tools.wmflabs.org/quick-inters... [10:39:29] 10Tool-Labs-tools-Erwin's-tools: "Fatal Error: Database query failed" for Related Changes tool - MySQL down? - https://phabricator.wikimedia.org/T104688#1531184 (10Supernino) 5Open>3Resolved a:3Supernino Now it's working again; don't know if @Erwin (Erwin) or others did something about it. I'll re-open it... [11:36:34] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Vikassy was created, changed by Vikassy link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Vikassy edit summary: Created page with "{{Tools Access Request |Justification=We plan to run OCR like tesseract for converting a full book(which takes time) in various Indic languages. In future we plan to create a..." [11:38:19] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Vikassy was modified, changed by Merlijn van Deen link https://wikitech.wikimedia.org/w/index.php?diff=173777 edit summary: [13:20:01] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1531487 (10coren) [13:20:01] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1531485 (10coren) 5Open>3Resolved The backups, they are run. [13:43:55] (03PS1) 10Ejegg: Filter more WMDE projects from #wm-fundraising [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/231002 [13:47:36] (03CR) 10Merlijn van Deen: [C: 04-1] "I think this was on purpose, as #wikimedia-dev is supposed to have all mediawiki related messages, including all extensions." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/230945 (owner: 10Jforrester) [13:57:58] Coren, ping [13:58:37] Cyberpower678: Pong. Didn't forget you. We've got new hardware on order soon which will give us a bit more breathing room for resources. [13:58:54] Coren, 32GB? :DD [13:59:24] Possibly. [13:59:27] Yay. [13:59:51] Coren, out of curiosity when is soon? [14:00:52] That's... always a bit random because it depends on a lot of things. Weeks, not months. [14:01:02] :-) [14:15:05] (03CR) 10Merlijn van Deen: [C: 04-1] "Would it be an option to explicitly list the projects that /should/ be matched? According to https://www.mediawiki.org/w/index.php?title=P" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/231002 (owner: 10Ejegg) [14:33:12] 6Labs: Support bare-metal server allocation in labs - https://phabricator.wikimedia.org/T95185#1531693 (10Dzahn) the original request says: " would save a lot of time dickering over hardware allocation" one of the follow-ups is immediately 'How will allocation work? Do we have a pool that rotates?" that may be... [14:41:10] !log tools forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 [14:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [14:48:54] valhallasw`cloud: I just ran your suggested qmod commands but I still see jobs as running on those exec nodes. Can you advise? [14:50:14] andrewbogott: mmm [14:51:24] valhallasw`cloud: this is what I’m running (with all node names as args) https://dpaste.de/8EML [14:51:42] should the second step just be qdel instead? [14:51:47] oh, yes [14:52:01] that probably gave an error? I'm not sure what happens if you qmod -rj an existing task [14:52:30] The job 1718497 is running in queue task@tools-exec-1210.eqiad.wmflabs where jobs are not rerunable. [14:52:54] ok, so https://dpaste.de/yjcx [14:52:56] ? [14:53:06] yes, although that won't work for the webgrid nodes [14:53:49] I would first qdel on all hosts, then do a second run to restart everything else, so... https://dpaste.de/VbRm#L4,5 [14:53:57] ehh [14:54:10] https://dpaste.de/40Bp#L4,5 [14:54:51] we want to do the del before the mod? [14:55:01] I thought the goal was reschedule and then kill what can’t be rescheduled... [14:56:00] andrewbogott: the 'grep' filters for tasks in the task queu [14:56:01] e [14:56:06] ah, ok. [14:56:11] OK, so here I go, running your script... [14:56:38] hm, I still see jobs [14:58:15] oh, missing a ‘do’ :) [14:58:58] well… hm, that seemed to do more but runningtasks.py still returns a lot [14:59:27] 12 jobs... [14:59:46] and none in state 'd' (except for the one I did manually just now) [15:01:13] not sure why runningtasks lists a different set :| [15:01:13] no, it's the same set [15:01:13] ok... [15:01:54] andrewbogott: manually ran sudo qdel $(qhost -j -h $HOSTS | grep task | sed -e 's/^\s*//' | cut -d ' ' -f 1|egrep ^[0-9]) [15:02:11] qhost -j -h $HOSTS now shows all of them in 'd' state [15:02:19] hm... [15:02:26] is that because there’s a typo in the script or because it just took a few minutes? [15:03:23] valhallasw`cloud: anyway, ready for me to reboot, yes? [15:11:47] andrewbogott: oh, you're already restarting. [15:12:05] valhallasw`cloud: yep, waiting for labvirt1001 to resurface [15:12:13] it takes these boxes forever to post [15:12:35] But I still care about the accuracy of that script, for the sake of future reboots... [15:13:47] ok, tools instances coming back up now [15:17:46] ok, cyberbot is back, let me enable the queue [15:18:28] cannot run on host "tools-exec-cyberbot.eqiad.wmflabs" until clean up of an previous run has finished [15:18:29] gaaah [15:18:50] andrewbogott: ok, so killing jobs before restarting definitely is a good idea (TM) [15:19:25] which we didn’t I guess because I didn’t include cyberbot in my arg list? [15:19:35] *nod* [15:19:42] oh, but the cleanup is reasonably fast [15:19:55] all cyberbot stuff is working again [15:20:04] ok, thanks! [15:20:35] want to look at /home/andrew/killjobs.sh and see if it needs alteration? [15:20:55] !log tools re-enabling queues on restarted hosts [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:28:45] andrewbogott: it can be made a bit simpler; qhost allows multiple hosts [15:28:48] let me hack a bit... [15:29:36] ‘k [15:29:52] andrewbogott: see /home/valhallasw/restarttools/killjobs.sh [15:30:33] andrewbogott: do you have a set of the ones that we need to disable now? [15:30:56] Will shortly [15:31:13] oh, it's on https://wikitech.wikimedia.org/wiki/Virt_node_upgrade_schedule of course [15:31:24] yeah, presuming no new nodes have been created [15:31:39] not sure about tools-services and tools-web-static-02... [15:31:59] yuviiiiiiiiiii [15:35:18] oh, we need to repool the nodes from 1001 as well. Do you want to do that or shall I? [15:35:34] just did that :-) [15:35:45] (assuming you mean re-enabling the queues) [15:36:51] wait, tools-master and tools-shadow are on the same virt host... [15:40:22] huh, that seems bad [15:42:38] It's an issue, but not a desperately critical one if we keep the downtime low - it prevents scheduling but does not disrupt anything running. [15:43:26] do we need any manual intervention on the hosts? [15:44:04] Not if they're both going down, there's nothing to be done. [15:44:50] But we should move -shadow away from that host for the future. [15:45:11] Actually, we could avoid disruption entirely if we moved the shadow to a virt host that has already been rebooted. [15:45:31] (before we shut down -master, obviously) [15:45:47] Sounds like a good idea. [15:45:58] the reboot is scheduled for friday 1500 UTC [15:46:29] Transferring instances between hosts is currently hard, and (in theory) after the reboots it will be come easy [15:46:44] So — if it’s safe to postpone the rebalancing… best to postpone. [15:47:05] If today is a good predictor for the downtime, that should be OK (but something to note in the email) [15:48:06] yeah, today should be typical [15:48:38] It's safe, just a bit more disruptive. I'm okay with postponing it if you think it better. [15:49:13] That said, is the difficulty with migrating instance in the destination or the source? [15:49:23] hm... [15:49:28] destination I think [15:49:48] although I’m touchy about it generally since I don’t know which steps involve suspend [15:50:19] So when I send the alert email for Friday I should just include ‘btw you won’t be able to schedule new jobs for a few minutes' [15:50:29] I think so, yes. [15:52:01] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org, 5Patch-For-Review: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1531891 (10ssastry) >>! In T108776#1530502, @Krenair wrote: > https://gerrit.wikimedia.org/r/#/c/228024/1/lib/sitematrix.json - seems like Parsoid has it's... [15:53:53] * valhallasw`cloud eyes grrrrit-wm [16:05:19] !log tools depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408 [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [17:14:04] valhallasw`cloud: did you move /home/valhallasw/mailer/runningtasks.py ? [17:18:04] nm, found it [17:18:30] Sorry, yea [17:20:52] Coren: I would appreciate a review of https://gerrit.wikimedia.org/r/#/c/229458/ so that I can archive things and close out that bug. [17:21:35] Looking at it now. [17:45:23] (03CR) 10Jforrester: "> #wikimedia-dev is supposed to have all mediawiki related messages, including all extensions." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/230945 (owner: 10Jforrester) [18:14:31] (03CR) 10Legoktm: [C: 04-2] "This was an intentional decision when #wikimedia-corefeatures was set up (https://static-bugzilla.wikimedia.org/54833) and AFAIK that hasn" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/230945 (owner: 10Jforrester) [18:56:54] labs-morebots: still there? [18:56:55] I am a logbot running on tools-exec-1210. [18:56:55] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:56:55] To log a message, type !log . [19:19:14] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org, 5Patch-For-Review: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1532353 (10Arlolra) Should the sitematrix mark it as nonglobal? https://gerrit.wikimedia.org/r/231086 [19:39:23] 6Labs, 10CirrusSearch, 6Discovery, 10Labs-Infrastructure: Make available an XL labs instance with ~350GB available disk space. - https://phabricator.wikimedia.org/T108767#1532435 (10EBernhardson) [19:43:13] 6Labs, 6Multimedia, 6operations, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1532461 (10Krenair) https://wikitech.wikimedia.org/wiki/File:Wikimedia_labs_logo.svg I was trying to put this on https://wikitech.wikimedia... [20:00:26] 6Labs, 6Multimedia, 6operations, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1532495 (10Krenair) It looks like https://wikitech.wikimedia.org/w/images/thumb/6/60/Wikimedia_labs_logo.svg/600px-Wikimedia_labs_logo.svg.... [20:02:40] !log tools.morebots valhallasw: Deployed 920d2c17c0856a3a5525d0d4ff1a6cc01d210cd6 Fix signature in changelog [20:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.morebots/SAL, Master [20:02:47] ok, that works [20:02:55] except that it probably won't when it's actually being restarted [20:02:58] dumdumdum [20:06:34] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org, 5Patch-For-Review: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1530322 (10Jdforrester-WMF) [20:42:32] hahah [20:45:00] !log deployment-prep restarted restbase on deployment-restbase01 (dead) [20:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [21:47:38] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org, 5Patch-For-Review: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1532852 (10Krenair) 5Open>3Resolved [22:54:39] Coren: can you do a ‘down-for-everyone-or-just-me’ test and ssh into a labs instance or two? [22:55:22] andrewbogott: I'm getting connection refused messages from things behind the hhtp proxy for labs [22:55:32] bd808: ok [22:56:00] I an traceroute and ping but no port 443 [22:56:30] I’m seeing that kind of thing too. I may have a fix [22:56:49] …just as soon as jenkins permits it [22:57:20] *nod* I can ssh to several semi-random instances [22:59:59] wait you /can/ ssh? [23:00:12] I’m… further confused [23:01:46] Not to all hosts apparently. ok for shaved-yak.eqiad.wmflabs, stashbot-deploy.stashbot.eqiad.wmflabs. Failed for ieg-dev.eqiad.wmflabs [23:02:23] ok, that’s less crazy then [23:02:36] bd808: ssh works for some tho? [23:03:03] it did. now I'm getting key change for bastion.wmflabs.org [23:03:13] key change! [23:03:15] I got that too [23:03:28] The fingerprint for the ECDSA key sent by the remote host is SHA256:qvsadWlkEWNfGXRKdnRT+uu/MGaT8GBFF1J033QFWmg. [23:03:34] why would a change in routing cause /that/ to change? [23:04:33] andrewbogott: if that setting is the metadata service registry? [23:04:53] maybe instead of ec2_dmz_host [23:04:58] but wut [23:05:06] why would that affect ssh? [23:05:19] And… now I’ve reverted everything but still can’t ssh in anyplace [23:05:21] would it affect keys or ? idk thsi seems unrelated [23:05:30] maybe red herring timing [23:07:00] if you have hte proxy command and the proxy host key changes, does that bubble up or does it just do weird stuff. Becaue I am getting 'WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!' for bastion-restricted.wmflabs.org [23:07:03] directly [23:07:08] yeap me too [23:07:11] was just about to ask! [23:07:14] accepted the new bastion key; not getting Permission denied (publickey). from hosts I got to just a few minutes ago [23:07:30] I get connection refused by web too [23:07:33] I'm getting it for tools-login.wmflabs.org [23:07:47] about remote host identification changes [23:08:04] chasemp: a change in the proxy host key will alert. That's what I was getting [23:08:42] confirmed. ECDSA host key for bastion-restricted-eqiad.wmflabs.org has changed [23:08:47] ok, but tools-login /is/ a proxy [23:08:52] why would it have changed? [23:08:55] should this work? [23:08:56] rush@labnet1001>ping deployment-elastic08.eqiad.wmflabs [23:09:10] chasemp: I wouldn’t necessarily expect that to work [23:09:22] because Host deployment-elastic08.eqiad.wmflabs not found: 3(NXDOMAIN) [23:09:31] I can ping it from within a labs instance [23:09:44] Did something magical happen to all our ssl installs, unrelated to changing that network setting? [23:10:05] what's the network setting? [23:10:22] network_host in nova.conf [23:10:23] mutante: https://gerrit.wikimedia.org/r/#/c/231177/ [23:10:26] which is now reverted, anyway [23:10:49] I’m getting recovery messages from labs instances. Which means that their metadata services are now fine [23:11:00] but, my ssh keys are now rejected everywhere. Including my root key [23:11:30] we are getting new alerts in operations [23:11:37] about nova-compute processes all being stipped [23:11:39] stopped [23:12:14] hm, ok... [23:13:05] andrewbogott: can you login on labvirt* ?> [23:13:13] yes [23:13:22] I just restarted all the nova-compute services [23:13:24] they seem ok now [23:13:24] I'm puppet apply'ing on labvirt1001 now [23:14:13] and we have a recovery [23:14:18] ok, but how would any of this cause ssh failures? [23:14:33] I mean, unable-to-read-key is different from key-mismatch isn’t it? [23:15:46] all I can figure is vm or node tries to call home to metadata service and fails then it takes some action like "ok I better wipe out everything I know here" [23:15:53] which is madness I guess? [23:16:10] yeah, I can’t imagine why it would do that. Even if it did, puppet should restore the ssh keys immediately. [23:16:13] Even if the host keys were lost [23:17:07] I'm now getting "Permission denied (publickey)." from all of my test vms [23:17:31] why would nova-compute die on all nodes at once? [23:17:36] the only thing interesting in the verbose output is "debug1: key_load_public: No such file or directory" [23:17:58] so remote side wiped out your key maybe [23:18:52] mutante: I think it died because it couldn’t contact rabbitmq for a moment [23:19:40] chasemp: apparently, but on 9 hosts in 4 projects :/ [23:19:48] mutante: I suspect that’s a red herring [23:20:16] Is the ECDSA key fingerprint for tools-login.wmflabs.org equals to 2b:f7:41:f9:55:8a:73:7c:bc:83:55:42:9e:af:49:ad ? [23:20:30] chasemp: oh... looks like maybe just on bastion.wmflabs.org maybe [23:20:41] not getting past the proxy hop [23:20:43] so, if apt automatically patched openssh everwhere, would that leave a trace in a log someplace? [23:20:56] andrewbogott: /var/log/apt/history.log [23:20:59] note that bastion-restricted is broken in the same way [23:21:10] mutante: that’s what I thought… it’s empty on the one box I have access to. [23:21:24] completely empty is weird [23:21:40] 2015-08-12 23:12:57.946 50327 INFO nova.openstack.common.service [-] Caught SIGTERM, exiting on labvirt1001 [23:22:22] 2015-08-12 23:14:02.874 50634 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 190, in _raise_timeout_exception [23:22:43] rabbitmq disappears for a minute, compute times out and loses it's mind and kills the service? [23:23:41] services are dying all over the place [23:23:48] nova-network just died after being up for a while [23:23:56] again on labvirt1009 yeah [23:23:58] I restarted rabbit, we’ll see if that results in more stsability [23:24:35] but it seems to still be having problems with rabbitmq [23:24:39] I'm slowing download a db dump, I hope download.wikimedia.org won't die meanwhile [23:24:44] or at least it looksl ike that's what downed nova on labvirt1009 [23:24:58] what is 208.80.155.155 [23:24:59] bd808: is everything working now, by chance? [23:25:08] it has no reverse record [23:25:11] I restarted on labvirt1009 andrewbogott btw [23:25:16] chasemp: me too :) [23:25:29] that is where the key comes from: [23:25:33] @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ [23:25:33] another fingerprint change for bastion [23:25:48] the corresponding IP address 208.80.155.155 [23:25:52] same bd808 [23:26:01] mutante: 155.155 is bastion-restricted, isn’t it? [23:26:06] bd808: I think it changed /back/ [23:26:11] by chance did it change back or new id all together? [23:26:13] did you wipe out the keys when you got the first warning? [23:26:18] yeah [23:26:22] I did [23:26:24] stupid of me [23:26:28] so... [23:26:29] I'm getting ssh to work now [23:26:36] Here is my operating theory: [23:26:43] - I pushed a puppet change that restarted some services [23:26:55] - somethingsomething race conditions caused rabbitmq to freak out [23:27:15] - services started randomly timing out due to losing track of rabbitmq [23:27:17] (as goes rabbitmq so goes the world...) [23:27:18] - including nova-network [23:27:34] BUT, that does not account for why the particular symptoms we saw were insane [23:27:56] it wasn't 2 minuts post rabbitmq restart that ssh stabilized and reverted (we think) [23:28:04] * bd808 whispers "openstack" and drops mic [23:28:07] and maybe rabbitmq never recovered post revert for network_host [23:28:09] down the rabbit(mq) hole [23:28:29] yeah [23:29:21] so, I’m going to boldly step away for three to four minutes, and see if there aren’t a whole bunch of new alerts when I get back. [23:29:27] 6Labs, 6Multimedia, 6operations, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1533300 (10Krenair) That looks much better now. So, PDFs: ```krenair@tin:/srv/mediawiki-staging (master)$ dpkg-query -S `which pdfinfo` pop... [23:29:29] we need someone with the original bastion keys to confirm if this key is the same, out of curiosity [23:29:34] I guess if it's not [23:29:44] you'll hear about it from all the ppl wondering why teh new bastion key [23:30:18] ok man! well I'm grabbing food let me know if I can help as I'll be around [23:34:46] original bastion keys are on wikitech [23:36:43] Can onlookers confirm that things seem pretty much ok now? [23:37:23] things are working for me andrewbogott [23:37:27] ok, thanks [23:37:40] are you back on bastion-restricted? [23:37:47] then please run: ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub [23:38:12] mutante: why? [23:38:13] works for me [23:38:16] to get the fingerprint [23:38:26] so wikitech can be updated that it's the right host [23:39:45] unless we expect it to be back to the one before [23:39:50] in that case, it isn't [23:39:52] Pretty sure it never changed [23:40:00] and instead a network issue caused misrouting [23:40:09] for example, if all attempts to hit labs instances instead hit labnet1001 [23:40:12] then i cant confirm the issue is gone yet [23:40:23] that would result in hostkey mismatch & ssh key failure [23:40:43] fingerprint I saw was SHA256:s+xuLo91PcVIFcFdxPQC7IXgJ2nYxaXcqa7bKE7/ufA [23:41:12] i still get d3:33:67:b9:14:08:0a:e6:e5:d9:fa:f9:4a:e7:3a:24 [23:41:34] running that command on the server tells us which one it has [23:42:04] andrew@bastion-restricted-01:~$ ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub [23:42:04] 2048 27:3a:f1:90:28:5e:dd:b3:04:83:10:57:3a:99:c1:d0 /etc/ssh/ssh_host_rsa_key.pub (RSA) [23:42:23] thanks, so i am talking to another server [23:42:39] mutante: ok, I’m totally lost [23:42:40] Helder: no [23:42:45] my questions are... [23:42:54] - Where on wikitech are the old host keys? I can’t find them? [23:43:03] - What do you mean 'i am talking to another server’? [23:43:21] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [23:43:48] andrewbogott: when i ssh to bastion-restricted-eqiad i am not getting a response from that server, but something different [23:43:50] ah! Thanks. They are (wrongly) stored in two different places, the other place is empty/out of date [23:44:19] but the fingerprint matches the wiki page [23:44:19] wtf [23:45:03] Helder: yes :p https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bastion-restricted.wmflabs.org [23:45:03] mutante: so, the key I get when I run your command and the key on the wiki page and the key I see when I ssh to bastion-restricted-eqiad are all the same [23:45:40] andrewbogott: eh, yea, part of the confusion is RSA vs. ECDSA [23:45:50] ok [23:46:01] so, things look back to normal for you as well? [23:46:55] mutante: ^ ? [23:47:30] andrewbogott: well, yes, as in "i can ssh to an instance via the bastion" again [23:47:44] mutante: and the keys you see seem reasonable? [23:47:46] hostkeys [23:48:18] yes, except this: [23:48:20] Offending key for IP [23:48:26] Matching host key [23:48:43] did you accept the new, broken key while things were breaking? [23:48:43] i dunno why, but the .155 IP is also normal [23:48:59] no [23:49:24] but now it looks like it was the right one [23:49:29] since we found the wiki page [23:49:49] …ok... [23:50:54] I’m going to declare victory for now (status quo regained!) and will write things up later. and then after that I’ll try to repeat the thing that caused the problem in the first place :/