[00:09:39] 10Tool-Labs: zoomviewer seems to be down - https://phabricator.wikimedia.org/T97790#1255891 (10dschwen) This seems to be a Chrome specific issue. Flash version works in FF. I'll try to detect the failure and redirect to the JS version. [00:28:16] 10Tool-Labs: zoomviewer seems to be down - https://phabricator.wikimedia.org/T97790#1255897 (10dschwen) Ok, this is implemented now. If the flash object cannot be successfully built the user is automatically directed to the JS version. [02:06:03] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [02:14:41] Hi all! [02:15:31] I'm having trouble establishing websocket connections to one of my instances. Was anything changed in the instance proxy setup by any chance? [02:27:25] Hmm, might be a Chrome issue [02:31:01] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [08:54:27] PROBLEM - Puppet staleness on tools-mailrelay-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [12:50:48] r1t3mm9 [12:50:56] Hah. [12:51:06] Another low-sec password bites the dust. :-) [12:52:04] :D [12:52:30] * petan tries to ssh root@en.wikipedia.org with this password.... SUCCESS [12:53:31] rm -rf /var/www/wikipedia [12:53:34] er. wrong window [13:21:48] tools-login appears slow? [13:32:57] liangent: I see a bit higher load than usual on the filesystem , but nothing out of the ordinary. [13:33:50] Coren: alerts have fired off for labstore1001 re: iowait [13:37:10] Yeah, I'm looking at the graphs now. [13:39:15] paravoid: I'm seeing it just tickle the 25% threshold we set matching pretty exactly a period of heavy writes, lasting an hour or so. [13:39:52] dunno, I haven't investigated it, just mentioning that I see a red light blinking :P [13:41:19] paravoid: Yeah, with the current threasholds labs load can actually push the iowait into the red even if nothing's technically amiss. I'm not entirely sure if that qualifies as signal or noise. :-( That said, we're switching to Jessie this PM which should reduce the checksum bottleneck eithfold. [13:41:41] With luck, everything will be securely plugged in this time. [13:42:13] if the thresholds are too low, fix that :) [13:42:23] also see https://gerrit.wikimedia.org/r/#/c/208632/ btw [13:45:06] paravoid: That's what I said - I'm not immediately clear whether that's a case of "threshold too low". /Right now/ it can trigger on normal load, but that's an artifact I'm pretty sure. I'll be able to get a better sense of what the "right" threshold is once nfs has been running on Jessie a day or two. [14:06:33] 6Labs, 10OpenStreetMap, 5Patch-For-Review: Block OruxMaps app from hitting labs proxy - https://phabricator.wikimedia.org/T97841#1256845 (10MaxSem) Can this be considered done? [14:27:03] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:49:10] 6Labs, 10Labs-Infrastructure, 6operations, 10discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1257018 (10Joe) 3NEW [14:49:20] Krenair: Thanks for solving that search namespace ticket :-) [14:52:04] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [14:54:52] 6Labs: Abolish use of ec2id for new instances - https://phabricator.wikimedia.org/T95480#1257052 (10Andrew) [15:07:35] Sysadmin needs food, badly. [15:26:20] Coren: I just updated puppetsigner to clear out obsolete certs and I watched the ec2id of every instance that was ever deleted scroll by… 1200 of them or so. [15:26:51] I note the reassuring lack of a whole pile of puppet failures, so that should be good. :-) [15:41:21] multichill, which search namespace ticket? [15:42:09] Krenair: https://phabricator.wikimedia.org/T67132 [15:42:32] Noticed it because it changed workboards [15:42:39] oh that [15:42:51] yw [15:53:46] Coren: have a moment to talk about split horizon? [15:53:56] andrewbogott: Sure. [15:54:16] You were suggesting that we run two different servers for the two different scopes. [15:54:20] Today, we /are/ doing that already... [15:54:27] since I haven’t adopted designate for public dns yet... [15:54:42] Ah, indeed. [15:54:57] So, pdns/mysql recurses to pdns/ldap which knows all the public stuff. [15:55:09] Though that's not so much a questino of design as it is one of practicality. :-) [15:55:33] I could, right now, hand-enter the entires that we need in pdns/mysql and everything would work fine I think. [15:55:42] But I’m not sure what that tells us about the long run. [15:56:13] We could plan on always having two different servers running two different instances of designate and pdns — one for private and one for public that backs up the private one. [15:56:29] That might allow us to use designate and horizon as designed rather than hacking around them [15:56:43] Unless you know that designate already supports a use case of ‘add this entry to this server and that entry to that one...' [15:56:45] That's definitely a plus. [15:57:17] If there are two designate services then Horizon will automatically detect both and think I can just silence the private one. So that the GUI only points to the public version. [15:57:45] Hm. How disruptive would testing that be? [15:58:00] Because that sounds like a good idea. [15:58:12] Well, all of the Horizon stuff is far off anyway since designate/horizon isn’t supported until K and we’re running I now. [15:58:36] So, the short run would be: I just add some entries to designate. [15:59:04] The long run (in K) would be: We replace the existing pdns/ldap server with a brand new designate/pdns/mysql implementation and point Horizon at it. [15:59:30] We still wouldn’t have any kind of automated system for split horizon, though, I’d still be adding entries by hand to the private server as needed (which is roughly how things are handled now, except in puppet.) [15:59:57] um… sorry if I’m starting in the middle of this story :) [16:00:51] * Coren ponders. [16:01:06] So right now, IPs end up in mysql how exactly? I think I'm missing a step. [16:01:30] right now mysql only knows about private dns. [16:01:46] And they’re entered automatically via ‘sink’ which handles instance creation/deletion notifications. [16:02:05] I can also twiddle the values with curl &c. [16:02:53] Allright. And when we add a name in wikitech to associate with a public IP, that simply gets stuffed into... LDAP right? [16:03:46] Yeah, that’s still handled with custom OpenStackManager code. [16:03:56] In theory there is ready-made code to do the same in Kilo/Horizon/Designate. [16:05:59] Now the only bit of trickery we'd need is for that code (openstackmanager or otherwise) to /also/ add a matching entry in the 'local' mysql table. [16:07:33] Right. I think I’d just let it slide for now since we’ve been handling it via puppet patches heretofor. [16:07:46] But in the Horizon future I’d want to automate it somehow, which might be easy or hard... [16:08:10] But not harder with the one-designate-two-pdns approach. [16:08:18] *than with [16:08:48] So, my immediate question is: What kind of records am I sticking in private pdns in order to replicate the behavior of an ‘alias’ in dnsmasq? [16:12:37] A simple A record does the trick. [16:12:49] But you need to actually serve the zone, which is trickier. [16:13:55] So we just need to remember to do it. Thankfully, the outside zones are meaningless on the private side except for what we add to it since the addresses are necessarily unreachable anyways. [16:14:30] And it might actually make *more* sense from a user point of you to get an NXDOMAIN for a name that can't possibly be reached that time out with packets going to a black hole. [16:26:15] Coren: ok, I will try... [16:26:21] after I catch up with all this gpg stuff [16:41:06] I cannot find "Add Instance" option in https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance as per https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Notes/Labs-vagrant [16:41:28] Where can I create new instance of lab-vagrant? [16:42:33] phoenix303: are you an admin for that project? [16:42:37] I am a participant in Outreachy program and would like to setup test wiki for project #translation-search? [16:43:15] andrewbogott, I am not sure [16:43:26] How can I check that? [16:43:41] phoenix303: what is your username on wikitech? [16:43:47] Phoenix303 [16:44:29] phoenix303: have you used labs before, at all? [16:44:45] (Sorry, I thought that lab-vagrant was an existing labs project but I see now that it is not) [16:44:46] no not at all [16:45:12] phoenix303: ok. I’m stretched a bit thin, who is your internship mentor? [16:45:25] Nikerabbit and Nemo_bis [16:45:38] * Coren looks for a project request phab ticket. [16:46:05] phoenix303: probably one of them is expecting you to do your work within an existing project, we would need to know what that project is and give you admin rights there to create an instance. [16:46:34] mediawiki-extensions-translate [16:47:35] andrewbogott: As a point of comparison, mediawiki-extensions-newsletter had a project created for it. [16:47:49] https://phabricator.wikimedia.org/T97523 [16:48:05] surely there’s already an existing project for translate... [16:48:16] or is there a project request and I’m missing it? [16:48:33] No, that's what I just checked. That one doesn't. [16:48:54] phoenix303: We'll need to have your mentor(s) chime in first. [16:48:59] I will create one and who should I assign to? [16:48:59] Nemo_bis: ping? [16:49:21] The language engineering project is too close to the quota already [16:49:23] phoenix303: It's not clear that you'll need a separate project, let's ask your mentors first. [16:49:39] Nemo_bis: We can up the quota at need - the only question is organizational. [16:49:42] Okay Coren [16:50:25] Nemo_bis: That is, does it make more sense for the instance to be part of that project or to be on its own. [16:50:36] On its own, definitely [16:50:41] Can https://wikitech.wikimedia.org/wiki/Nova_Resource:Pagemigration be renamed? [16:50:56] no renaming, sorry [16:51:16] Well doesn't matter that much [16:51:27] I wonder if we should have a blanket GSOC/Outreachy project that can be reused [16:51:38] Nemo_bis: Do big deal either way. I'll just create a project. [16:51:56] bd808: Not really worthwhile IMO - projects are, in themselves, lightweight. [16:52:16] bd808: Add new instance to project ~= add new instance in new project. [16:52:16] cool. I remember being told otherwise a long time ago [16:52:34] bd808: That might have been because gluster volumes. Those _were_ hella expensive. [16:52:51] Coren: actually, I'll just add her to our "pagemigration" project [16:52:53] seems possible. It was in the way long ago for sure [16:52:56] I don't want to admin a 7th project [16:53:10] phoenix303: what's your username? [16:53:13] Nemo_bis: Hence, "organizational" :-) [16:53:21] Phoenix303 [16:53:24] "Successfully added Phoenix303 to pagemigration. " [16:54:08] phoenix303: you should now be able to create an instance [16:54:26] At https://wikitech.wikimedia.org/wiki/Special:NovaInstance [16:55:40] Feel also free to do whatever you wish with special-pm-instance, but not with "nemobis" which is currently a hell of extreme phpunit runs [16:56:22] Nemo_bis, I cannot see "Add Instance" option [16:57:08] phoenix303: you might need to logout and login again [16:58:49] was your issue sorted out? [16:59:43] phoenix303: also you need to select the project name in the filter up top. [17:02:20] Nemo_bis, I can see two instances "nemobis" and "special-pm-instance" as you have mentioned. Do I have to add another instance or use existing one "special-pm-instance"? [17:02:34] phoenix303: add a new one [17:02:47] we'll think abour deleting the other later [17:03:33] ok. Thank you Nemo_bis if I ran into a problem will ping you :) [17:04:20] Good [17:04:29] Thanks also Coren and bd808 for quick assistance :) [17:05:14] Yeah sorry Thanks Coren, andrewbogott and bd808 for your instant help :) [17:08:05] Nemo_bis, instance type m1.small, m1.medium, m1.large or m1.xlarge? [17:09:00] phoenix303: small should suffice [17:11:12] Thank you Nemo_bis. Created an instance. [17:14:55] * yuvipanda waves at andrewbogott and Coren [17:15:01] * yuvipanda is on way to office [17:15:20] Good BART [17:15:42] I'm walking to Bart now [17:15:58] Coren: Alex left some comments on the labs storage patch [17:16:12] * Coren goes to read. [17:19:22] Coren: Hey! I set up a web proxy for a instance one day ago. But it won't load on my browser, I get '504 Gateway Time-out'. Any ideas to fix it ? [17:19:33] newsletter-test.wmflabs.org - this is the one [17:20:03] tinajohnson: Make sure that the security group applying to the target instance actually allows HTTP (port 80); otherwise the proxy will timeout trying to reach it. [17:20:18] (Because it'll be filtered) [17:22:47] Um, anything I can do to un-filter it ? [17:25:01] tinajohnson: Add port 80 to your default security group. Or better yet, create a new security group for web hosts, and apply it to any new instances. But unfortunately you can't update what security groups a running instance is in. [17:25:24] Coren: Btw, is there a way I can add a public ip to one of my instances in labs? [17:26:07] csteipp: We have to up your quota; but we tend to be miserly with IPs. Why do you need it public as opposed to just proxied? [17:27:17] Coren: I'm testing out a honeypot, and I'd like for live, non-web traffic to hit it. [17:27:32] csteipp: Thanks ! [17:27:48] That's a reason. Please to open a phab ticket to get the quota? I'll be able to do it after our ops meeting. [17:27:59] Coren: will do, thanks! [17:39:08] 10Quarry, 6Analytics-Kanban: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1257804 (10kevinator) a:3Milimetric [17:39:28] csteipp: I don't know why but the rule fails to get added :\ [17:39:40] 6Labs: Add public ip to security-tools - https://phabricator.wikimedia.org/T98038#1257807 (10csteipp) 3NEW [17:41:09] tinajohnson: I've had it randomly fail too. Definitely cut-and-paste from https://wikitech.wikimedia.org/wiki/Help:Security_groups [17:41:51] csteipp: worked just now! [18:08:17] I have created an instance i-00000c0f.eqiad.wmflabs and when I try to do "ssh i-00000c0f.eqiad.wmflabs" on my machine it shows "Name or service not known" [18:08:56] also get console output action shows "no authorized ssh keys fingerprints found for user ubuntu." [18:09:35] phoenix303: Ignore the aws ID; use the actual name you have it instead. :-) [18:12:33] Coren, I tried "ssh translation-search" (translation-search: instance name) still it throws same error "ssh: Could not resolve hostname translation-search: Name or service not known" [18:13:06] phoenix303: We're in the middle of a meeting right now, I'll take a look at it shortly afterwards. [18:13:19] sure Coren Thank you :) [18:13:47] phoenix303: I’m in the same meeting, but, are you already doing this? https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 [18:18:17] phoenix303: Have you ssh-ed to your project already ? like shellname@bastion.wmflabs.org [18:18:20] andrewbogott, no. I am using steps in https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Notes/Labs-vagrant. I have also generated new ssh keys and pasted it "open stack preferences". [18:18:44] phoenix303: Try ssh translation-search.eqiad.wmflabs [18:18:47] no [18:19:03] phoenix303: that page is missing a whole section about getting through the bastion. bd808, take note [18:19:13] phoenix303: meanwhile, best to follow the page I linked above. [18:20:27] tinajohnson, Niharika same error as andrewbogott said I have missed number of steps. [18:20:48] phoenix303: labs instances are not by default publicly available at all, you have to tunnel [18:21:36] thank you andrewbogott did not know that [18:30:39] 6Labs, 10Labs-Infrastructure: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1258110 (10coren) 3NEW a:3coren [18:31:33] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1258122 (10yuvipanda) [18:31:56] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Harmonize VMEM available on all exec hosts - https://phabricator.wikimedia.org/T95979#1258124 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [18:39:32] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Switchover Labs NFS server to labstore1002 - https://phabricator.wikimedia.org/T97219#1258147 (10coren) Rescheduled for today, same time (19h UTC) [18:49:44] Created and added content as mentioned in https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 replacing with Phoenix303. [18:50:04] Then ssh translation-search.eqiad.wmflabs [18:50:20] Error: The authenticity of host 'bastion.wmflabs.org (208.80.155.129)' can't be established. [18:50:30] Permission denied (publickey). [18:52:27] I have two public keys first for window and second (created today) for ubuntu. This should not be a problem? [18:54:52] Currently I am on ubuntu [18:59:25] phoenix303: ok, meeting over. Let me catch up... [18:59:53] phoenix303: when it says the authenticity can’t be established it should prompt you to accept the fingerprint. Doesn’ t it? [19:00:26] phoenix303: official fingerprint list is here: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [19:04:10] ahem, FormatJson:parse stoped working, graph ext started throwing errors [19:04:42] highly weird - FormatJson has existed for a very long time [19:04:43] andrewbogott, it does but then it says Permission denied (publickey). [19:04:59] phoenix303: ok, you’re saying ‘yes’ [19:05:07] and then… it doesn’t prompt you like that on subsequent attempts does it? [19:05:40] yurik: https://phabricator.wikimedia.org/T98050#1258204 [19:06:54] legoktm, will it fix itself in betalabs too? [19:07:11] yurik: eventually it should yeah [19:07:31] andrewbogott, no [19:07:54] Permission denied (publickey). [19:07:54] ssh_exchange_identification: Connection closed by remote host [19:08:03] phoenix303: ok. Try connecting to bastion.wmflabs.org? [19:08:09] I’m going to look there and see what it says [19:08:54] On "ssh bastion.wmflabs.org" [19:09:00] Permission denied (publickey). [19:09:10] phoenix303: Make sure you are using your _shell_ account name (which probably doesn't have a capital) [19:09:33] yeah, that’s part of the problem at least. [19:09:49] there you go [19:09:58] same should work for your other instance as well. [19:11:45] legoktm, is it possible to force sync labs? [19:11:46] Coren, I added "phoenix303". Now it says "Permission denied (publickey)." how would I know that the connection has been established? [19:12:03] that 503 is really annoying ) [19:12:09] By not telling you that, for the most part. :-) [19:13:12] but it does :( [19:13:41] phoenix303: Wait, I saw you logged in on bastion. [19:16:09] as did I [19:16:59] andrewbogott, so is it working? [19:17:12] phoenix303: that seems like something that you should know the answer to :) [19:17:43] phoenix303: You were able to connect to the bastion. I'm not sure why you think you weren't. [19:20:36] Coren, the error "Permission denied (publickey)" made me think and there is no confirmation message. But since I know now it worked so I can move on to the next step. [19:21:13] You should know it worked for the simple reason that after an ssh to the bastion you would get a shell prompt on that server. :-) [19:22:05] yes Coren I got that will keep that in mind :) [19:22:24] andrewbogott and Coren : Thank you for your help :) [19:26:51] yes Coren I got that prompt(which I did not know was sort of confirmation) and this I will keep in mind :) [19:31:33] phoenix303: I remember getting this error when connecting from bastion to VM. I generated a fresh ssh key from bastion, pasted that in Wikitech prefs, and got it right. [19:33:05] legoktm, i think it is still broken in betalabs - can we force it somehow? [19:33:57] yurik: it looks like the jenkins job is stuck...you can probably force scap manually and see if that works? [19:33:58] idk [19:34:29] legoktm, i don't remember how to force scap on depl cluster :) [19:34:46] yurik: just login and type in "scap" [19:37:26] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1258393 (10coren) This is simplest at the instance level proper, by putting an THC tc on the nic side; at the cost of having to deploy tc rules on every instance (and easily circumvent... [19:38:34] scaping.... [19:38:44] from deployment-bastion [19:38:57] hope noone kills me ... [19:40:56] Traceback (most recent call last): [19:40:56] File "/mnt/srv/deployment/scap/scap/scap/cli.py", line 276, in run [19:40:57] exit_status = app.main(extra_args) [19:40:57] File "/mnt/srv/deployment/scap/scap/scap/main.py", line 39, in main [19:40:57] self._before_cluster_sync() [19:40:57] File "/mnt/srv/deployment/scap/scap/scap/main.py", line 223, in _before_cluster_sync [19:40:59] tasks.cache_git_info(version, self.config) [19:41:01] File "/mnt/srv/deployment/scap/scap/scap/tasks.py", line 74, in cache_git_info [19:41:03] with open(cache_file, 'w') as f: [19:41:05] IOError: [Errno 13] Permission denied: u'/srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-ShoutWikiAPI.json' [19:43:35] tonythomas, I have created one today and have pasted it and I am able to connect to bastion.wmflabs.org [19:44:43] That problem is solved though the next step is not working :) [19:58:49] What's a good way to tell when a new instance ought to be ready? [19:58:58] * halfak is looking at labels.eqiad.wmflabs [19:59:27] halfak: ssh in and see if that works :) [19:59:40] yuvipanda, ha. SSHing has been failing for ~10 minutes. [19:59:44] Not sure if I should wait longer. [19:59:49] key denied [20:00:02] halfak: precise or trusty? [20:00:08] trusty [20:00:30] Instance state: "Active" [20:00:44] I can ssh into other trusty machines in the project [20:00:50] andrewbogott: ^ [20:00:55] I can’t login with my root key either [20:00:56] weird [20:00:58] Maybe I should try a reboot [20:01:00] k [20:01:09] Will wait for andrewbogott [20:01:42] halfak: what project is that? [20:02:24] revscoring [20:02:31] andrewbogott, ^ [20:07:18] halfak: something interesting is happening, this will take a few minutes. [20:07:26] OK [20:07:31] * halfak holds tight [20:07:32] :) [20:18:03] yuvipanda, do you know how to force sync betalabs? :( [20:18:09] nope. [20:18:10] it is borked [20:18:14] yurik: ask in #wikimedia-releng [20:26:37] yuvipanda: As planned, I'm not going to do the start-nfs script but do each step from it by hand instead and double check at each point. Cool with you? [20:26:52] Coren: yup. just !log each step on -operations/ [20:26:57] Coren: more verbose the better :) [20:55:29] * halfak twiddles thumbs [20:55:31] halfak: should be fixed now I think [20:55:32] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:33] andrewbogott, "Connection to 208.80.155.129 timed out while waiting to read" [20:55:38] halfak: well… now there’s an NFS outage, so, I suspect that’s a factor [20:55:39] A scheduled outage, I should add [20:55:39] how long will this outage last? [20:55:40] andrewbogott, gotcha. No worries. Thanks for taking a look so quickly :). When should I check again? [20:55:43] apper, halfak, it’s predicted to take 30 minutes but is scheduled for 3 hour window. [20:55:45] kk. Thanks andrewbogott [20:55:45] andrewbogott: thanks. Is this the normal time of the day for scheduled outages? I understand the need for such outages, but it's a bit frustrating if you have one hour in the evening to work on a request and you're not able to do so. I know that something like to "best time" does not exist for a server, which is used worldwide. But it seems to me, that it's very often around 8 pm to 12 pm UTC. [20:55:46] apper: we did schedule it a week in advance... [20:55:46] apper: and for better or worse all the labs people are in US west coast now. [20:55:47] Is labs down? [20:55:48] Melos: “NFS switch in progress, expect minor NFS stalls over the next hour” [20:55:49] yuvipanda: How long is it supposed to be broken? Wdq jus seemed to have stopped working [20:55:49] multichill: the switchover failed again, so it is being rolled back. I hope 5 minutes [20:55:49] *sigh* [20:55:49] Hmm, ok, so you get to fill out an incident report again/ [20:55:50] ? [20:55:50] multichill: gifti well, we’re well within our window, so I guess not? [20:55:50] We always do a post mortem if we don't achieve the expected result, very much the same as after a (big) incident [20:55:50] Stuff that wasn't suppose to happen, happened..... [20:55:51] multichill: oh yeah totally [20:55:52] multichill: but I think that’s just going to be on the ticket [20:56:57] yuvipanda: can you throw it in the email too? [20:59:56] Betacommand: yeah. [21:00:40] * halfak logs into labels.eqiad.wmflabs [21:00:45] :) [21:01:48] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:02:36] Stupid version mismatch. :-( [21:04:04] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 764627 bytes in 2.474 second response time [21:04:38] halfak: were you able to log in? [21:05:05] yuvipanda, yessir :) [21:05:12] halfak: sweet [21:07:52] "502 Bad Gateway" error while trying to access http://translation-search.wmflabs.org [21:08:16] phoenix303: did you set the security group etc.? [21:09:36] PROBLEM - SSH on tools-submit is CRITICAL: Server answer: [21:09:52] Nemo_bis, default security group is already set. Do I have to create a new one? [21:11:08] phoenix303: instances are not accessible from outside by default [21:11:16] (Unless that changed in the meanwhile) [21:12:43] no i guess they are not accesible from outside [21:13:37] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Switchover Labs NFS server to labstore1002 - https://phabricator.wikimedia.org/T97219#1258827 (10coren) This was attempted and failed, as thin_checked failed on labstore1002 because of a metadata format issue (I expect caused by the version difference). This will need... [21:20:53] uh [21:20:54] tools.extreg-wos@tools-bastion-01:~$ crontab -l [21:20:54] ssh_exchange_identification: Connection closed by remote host [21:20:57] what does that mean? [21:20:59] Nemo_bis, when creating "translation-search" instance I have already used "default" security groups. Can I change that? [21:21:42] RECOVERY - Puppet failure on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [21:22:05] legoktm: that means tools-submit is borking, probably [21:22:16] uh, I hope the instance isn’t dead [21:22:19] * yuvipanda curses tools-submit [21:22:24] needs restarting [21:22:34] :\ [21:22:36] aaargh [21:22:39] fuckign wikitech [21:22:42] * yuvipanda logs out and back in [21:22:49] andrewbogott: the sessions seem to last less than 2h now... [21:22:50] nfs mount dead/ [21:23:14] yuvipanda: I graceful’d apache an hour or two ago. That shouldn’t have reset memache though… [21:23:24] wikitech loses my session data every 30 seconds or so, so I have to confirm every action two or three times :/ [21:23:45] is it just running low on memcache space? [21:23:52] it’s also really really slow [21:24:08] !log tools reboot tools-submit, was stuck [21:24:11] Logged the message, Master [21:25:22] valhallasw`nuage: we should ackup the crontabs :| [21:25:25] if that instance goes we’re hosed [21:25:30] didn’t you file a bug for that? [21:25:35] wikitech is… working just fine for me, still has my session data, quick responses. [21:25:53] It's also working fine for me. Hm. [21:25:57] yuvipanda: how long will that take? [21:26:06] or do I need to logout/login or something? [21:26:15] legoktm: for crontab -l? shouldn’t. [21:26:25] ssh: connect to host tools-submit port 22: Connection refused [21:26:36] ok now I'm at [21:26:37] no crontab for tools.extreg-wos [21:26:43] and it most definitely has a crontab! [21:27:13] btw, Coren, Yuvi, I’m doing a scripted migrate from virt1012 to labvirt1007. No tools instances will be affected. [21:27:54] legoktm: I… don’t see a crontab for it [21:27:58] wtf [21:28:20] root@tools-submit:/var/spool/cron/crontabs# ls | grep extra [21:28:24] root@tools-submit:/var/spool/cron/crontabs# [21:28:47] legoktm: it isn’t in the backup I took a week ago either. [21:28:51] according to my .bash_history I created it at some point [21:29:15] maybe it's one of the local cronjobs that have to be cleaned up [21:29:16] it was one line so I can restore it easily [21:29:20] yeah [21:29:24] maybe it’s on tools-trusty? [21:29:28] * yuvipanda tries [21:29:35] RECOVERY - SSH on tools-submit is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [21:29:44] yup [21:29:48] legoktm: it’s on tools-trusty [21:29:57] tools.extreg-wos tools.revibot tools.revibot-ii [21:29:57] tools.reportsbot tools.revibot-i tools.revibot-iii [21:30:13] legoktm: 0 * * * * jsub -l release=trusty -N generate -once -quiet -mem 900M /data/project/extreg-wos/venv/bin/python /data/project/extreg-wos/extreg-wos/generate.p [21:30:18] y [21:30:19] anyway [21:30:23] I’m going to restore those [21:30:25] thanks [21:30:29] yuvipanda: wait [21:30:35] valhallasw`nuage: ok [21:30:37] yuvipanda: you can't just dump those to tools-submit, I think [21:30:43] unless they are purely jsub ones [21:30:50] valhallasw`nuage: so first step is to retire tools-trusty. [21:30:56] btw [21:30:57] tools.extreg-wos@tools-trusty:~$ crontab -l [21:30:57] no crontab for tools.extreg-wos [21:31:02] so there's no way for me to access it? [21:31:04] valhallasw`nuage: which I’m going to do by changing public IP of tools-trsuty to point to tools-bastion-01 [21:31:05] legoktm: use the other crontab [21:31:09] legoktm: /bin/crontab or something [21:31:13] yuvipanda: err? [21:31:21] valhallasw`nuage: lots of people are still logged into it [21:31:34] so next time they login, they’ll go to tools-bastion-01 [21:31:40] /usr/bin/crontab -l worked [21:31:41] thanks [21:31:41] and you're going to hard-break their connections by changing the ip? [21:32:13] bah, I can’t do that can I... [21:32:19] * yuvipanda ponders now [21:32:28] valhallasw`nuage: err, I am just going to change the hostname [21:32:31] DNS [21:32:31] yuvipanda: just disallow logins for non-roots? [21:32:35] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:32:41] hmm [21:32:44] that’s a cleaner solution [21:32:56] with a note 'cannot login, please use this other host' [21:33:47] ph... uh, too late [21:34:27] 10Tool-Labs, 3ToolLabs-Goals-Q4: Crontabs are not backed up - https://phabricator.wikimedia.org/T95798#1258873 (10yuvipanda) Going to do (3), as a purely sysadmin accessible backup. T90561 should take care of it long term. [21:37:34] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [21:40:51] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:41:12] woohoo [21:41:14] tools.extreg-wos@tools-bastion-01:~/src/extensions$ git pull [21:41:14] error: inflate: data stream error (incorrect header check) [21:41:14] fatal: loose object 8ba59af0a146ba87fb9ac1194168a050b324c8cc (stored in .git/objects/8b/a59af0a146ba87fb9ac1194168a050b324c8cc) is corrupt [21:41:14] fatal: The remote end hung up unexpectedly [21:41:28] is there a way to fix that without re-cloning? [21:41:33] ugh [21:41:37] this happened to valhallasw`nuage earlier... [21:41:40] Coren: file corruption [21:41:57] legoktm: cd .git, rm -rf objects, git fetch --all [21:42:06] legoktm: can you wait for a minute? [21:42:09] yes [21:42:16] no rush, it's apparently been broken for a while [21:42:21] legoktm: heh, cool :) [21:42:26] https://phabricator.wikimedia.org/T96488 [21:42:49] legoktm: is that file filled with zeros? [21:42:54] legoktm: as in \x00s [21:43:51] valhallasw`nuage: umm, how do I tell? [21:43:58] you can also check yourself :) [21:44:07] oh, right [21:44:12] but vim .git/objects/8b/a59af0a146ba87fb9ac1194168a050b324c8cc, basicallt [21:45:54] then ^@ is NUL [21:46:06] but strangely enough there's a hash at the top of the file that's not NUL [21:46:39] valhallasw`nuage: I wonder if it’s NFS interaction with git or just git or just NFS. [21:46:42] * yuvipanda hopes it’s the former [21:47:12] yuvipanda: yeah, or git is the only product that notices and complains :( [21:47:20] that too [21:50:37] 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1258912 (10yuvipanda) p:5Triage>3High [21:50:51] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1218173 (10yuvipanda) [21:51:28] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1218173 (10yuvipanda) This was reported by @legoktm just now again: ```tools.extreg-wos@tools-bastion-01:~/src/extensions$ git pull error: inflate: data stream error (incorrect header check) fatal: loose object... [21:53:45] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1258932 (10yuvipanda) This could be due to interaction between git / NFS, or just git, or just NFS. I hope it's just git. [21:55:36] yuvipanda: can I copy the folder somewhere else so you can keep investigating while I fix my tool? or would that make it harder to investigate? [21:55:46] legoktm: nah, you can do that. [21:55:52] legoktm: I’d suggest recloning fresh tho [21:56:13] I love recloning 700+ git repos! [21:56:28] legoktm: why do you have 700 git repos?! [21:56:30] legoktm: killing the objects dir means you have to pull in all changes anyway... [21:56:41] legoktm: although, maybe not [21:56:45] because submodules [21:56:47] yuvipanda: it's every single mediawiki extension [21:56:51] aaaahahaha [21:56:51] right [21:56:53] carry on [21:58:14] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1258945 (10Legoktm) I moved the corrupt folder to `/data/project/extreg-wos/src/extensions-corrupt` so I can unbreak my tool. [21:59:21] I'm just recloning fresh now [21:59:25] thanks for the help :) [22:01:02] yuvipanda: Where corruption? [22:01:09] Coren: https://phabricator.wikimedia.org/T96488 [22:01:15] has paths [22:01:37] Yeah, I saw in -security. [22:05:48] RECOVERY - Puppet failure on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:15:39] yuvipanda: Could you check if the lab nfs failover broke the wdq replication? [22:18:09] multichill: looking [22:30:24] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259115 (10coren) It's not immediately clear what could have happened to those files - the pattern does not match any form of usual corruption I've ever seen NFS do when in breaks badly. One pattern that may be o... [22:31:29] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Crontabs are not backed up - https://phabricator.wikimedia.org/T95798#1259121 (10yuvipanda) 5Open>3Resolved a:3yuvipanda All good now. [22:35:03] 6Labs, 10OpenStreetMap, 5Patch-For-Review: Block OruxMaps app from hitting labs proxy - https://phabricator.wikimedia.org/T97841#1259138 (10yuvipanda) Yes, for now. It's blocked at the proxy level based off UA. I think @Coren poked legal about this? Not sure. @akosiaris can you poke the maintainers of the m... [22:44:52] 10Tool-Labs, 3Labs-Q4-Sprint-1, 3ToolLabs-Goals-Q4: Explicitly define all the services that Tool Labs provides and their interfaces - https://phabricator.wikimedia.org/T93622#1259194 (10yuvipanda) 5Open>3Resolved a:3yuvipanda T97748 and T97610 are the ones left to do. This is already 'defined'. [22:45:09] yuvipanda: Seems to have picked up again (wdq) [22:45:26] multichill: bah, humbug. I didn’t actually look :| (was distracted by tools-submit being down) [22:45:54] Thought so, might have lost some edits, but it picked up my latest edits in several minutes [22:46:04] multichill: cool [22:47:08] nn [22:51:36] yuvipanda: I'm heading off to bed, but I don't get the (15 * 0.1) in your calculation at https://phabricator.wikimedia.org/T97610#1247911 [22:55:25] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259267 (10scfc) Could these be related to one of the moving instances between virtual servers? I. e., process A gets frozen, meanwhile process B updates the repository, process A gets thawed and is confused abou... [22:56:51] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259268 (10coren) After correlation in time, it turns out that the file timestamp exactly match the period where the NFS server had to be forcibly rebooted without clean unmounts; it is almost certain that the cor... [23:12:59] twentyafterfour: beware the new setup issues and database overhaul on the wednesday phab deployment [23:14:11] valhallasw`nuage: yuvipanda any known labs outage ? [23:20:37] /msg NickServ VERIFY REGISTER KodiakAstronaut maolsdduejpz [23:20:38] matanya: I’m still slowly migrating things, so every once in a while an instance will go down for a few minutes and then restart. Nothing in tools or beta though. [23:21:04] andrewbogott: ssh encoding01.eqiad.wmflabs [23:21:04] channel 0: open failed: administratively prohibited: open failed [23:21:04] stdio forwarding failed [23:21:04] ssh_exchange_identification: Connection closed by remote host [23:21:48] KodiakAstronaut: oops [23:21:58] matanya: that’s not me… is it happening for other instances as well? [23:22:10] checking [23:22:25] If it is, maybe try switching your proxy command to a different bastion as well [23:22:56] no it doesn't andrewbogott [23:24:59] matanya: ok… what project is that in? [23:25:11] video [23:26:22] that instance is stopped entirely. I don’t know why, but I can start it if you like :) [23:26:49] Or, actually, why don’t you log into Horizon.wikimedia.org and start it yourself? It’ll be a useful test. [23:28:48] yuvipanda: lifehack: chrome can clear its cache for a specific domain really easily [23:29:00] doing andrewbogott [23:30:17] andrewbogott: not enough rights as it seems [23:30:40] nvm [23:40:48] matanya: looks like it worked! [23:41:12] yes, thanks much, lets see if the machine can do a 4k transcode [23:41:18] that little popup menu is alphabetized, I wish I could have ‘reboot’ be the top option. [23:49:04] 6Labs, 10Tool-Labs, 6operations: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259449 (10yuvipanda) p:5High>3Normal @Coren should we check for other files that might be corrupted, or just let things be?