[03:56:47] Something on tools-exec-08 is slowing it down massively [03:57:16] can't login, don't have access to the bot on the instance, can't stop the bot on the instance either [04:42:18] tools working again? [04:42:22] * andrewbogott is late to the party [04:51:12] <^demon|away> Party? Nobody told me about a party. [04:54:18] 'outage party' [04:54:37] it's just like a regular party except everyone sits in the corners looking at laptops and nobody drinks or speaks [05:20:25] That sounds like a typical meeting in the Foundation Offices. [10:33:53] Could a labs admin help me figure out what happened to the cvn project? For some reason it seems completely broken. [10:33:58] cvn-apache2 isn't coming back up for several days now. I've tried rebooting it, and ganglia sometimes shows some activity, but it doesn't respond to web requests and shell seems dodgy as well [10:34:01] http://cvn.wmflabs.org/ [10:34:05] https://wikitech.wikimedia.org/wiki/Nova_Resource:Cvn [10:34:41] the cvn-app1 server is also not running properly, seems to have problems with network or project data disk. The shell bots scheduled from crontab aren't running at least. [10:34:56] I created two new instances yesterday, but they haven't showed up in ganglia or the nova page for Cvn [10:35:08] they seem to be disconnected from the real labs, and only show up in global listings [10:35:13] (cvn-apache3, cvn-app2) [10:53:34] Coren: ^ [12:08:41] hello guys.. [12:34:05] Krinkle|detached: I fixed some gluster stuff with cvn, it may be happier now. [12:36:54] andrewbogott, I thought we were supposed to migrate to NFS4?:) [12:37:04] be patient please [12:39:32] ...and wait for NFS5?:P [12:41:47] yes, exactly :) [12:43:39] Coren is building NFS infrastructure in eqiad today. It's quite close to finished, actually. [12:45:37] I hope it will be properly puppetized as well, so we don't run into the situation as yesterday :-). [13:00:16] andrewbogott: I assume the move will require copying all volumes from pmtpa-NFS to eqiad-NFS? [13:01:03] it will, although Coren and I will do a lot of that. [13:01:07] Depending on the project... [13:02:25] Last time, there were a lot of tools' log files that made the process take longer than necessary. If an ETA is on the horizon for the NFS move, we could/should ask the tool maintainers to truncate old log files. I assume few people actually need access/error logs from six months ago. [13:03:11] * andrewbogott nods [13:10:23] scfc_de: it also doesn't help that's the default operating mode of jsub [13:16:46] valhallasw: I'm thinking more about access.log and error.log. These can easily become very huge vor GeoHack & Co. [13:35:53] scfc_de: oh, right. Can't we just get log rotation for those? [13:41:40] valhallasw: Hmmm. With the Apache setup, it might have been possible, with the lighttpd it looks more complicated: In the script started by "webservice start", start a loop "logrotate access.log/error.log; sleep 24h"? [13:50:27] scfc_de: Yesterday's issues had nothing to do with puppet, there was an undocumented step in recovery. It's documented now. :-) [13:50:52] Coren: I use puppetization and documentation as synonyms :-). [13:51:19] scfc_de: puppet does a good job of documenting state; not so much process. :-) [13:52:44] andrewbogott: There was something I wanted to discuss with you though. Right now, labs uses autofs for the mount, and that leads to some rather painful issues (autofs sucks very much when you try to change its config). How do you feel about puppetizing entries in fstab instead? [13:53:34] Coren: BTW, regarding that, how are .snapshots handled? On the NFS server, or on the instances? [13:53:37] andrewbogott: We only have a few of them, and they are all pretty static by definition. I can't think of a downside; it's not like we're mounting a different home for every user. [13:53:56] scfc_de: That's server side, with snapshots. [13:54:12] scfc_de: But that's also not going to come back for a while. [14:01:21] scfc_de: that could work, or just logrotate everyones access.log/error.log [14:28:46] valhallasw: But you have to ping lighttpd (or whatever else the tool chooses to produce the logs) to create a new file. That's especially necessary if you compress the old ones. [14:31:36] Coren: back, catching up... [14:35:35] Coren, one downside to using puppet is that we wouldn't have instance access until after a clean puppet run. That could make things a bit fragile. [14:36:03] But overall it sounds OK to me. Are there any performance concerns with having more vs. fewer active mounts for a given volume? [14:36:22] (With autofs that most instances don't actually mount the shared volume until it's access, right?) [14:38:16] andrewbogott: The instance not being accessible until a clean Puppet run seems like a feature to me :-). [14:39:00] scfc_de: I agree, presuming that puppet actually runs. In the case of labs breakage it might make debugging harder. [14:39:24] andrewbogott: No performance hit for idle mounts except a few ms at boot. [14:39:46] Coren: ok then. I wonder if maybe gluster felt otherwise? [14:40:44] We could depend using Puppet on the labnfs role, thus excluding Gluster projects. [14:40:45] Also, that shouldn't change login before puppet run: the /public/keys mount was also dependent on puppet having run, wasn't it? [14:40:59] scfc_de: No gluster at all in eqiad anyways. [14:41:31] I think /public/keys is in the instance image itself? [14:41:50] scfc_de: Yeah, even if it isn't it could be. [14:42:33] Coren, logins worked before puppet runs. So I think scfc_de must be right and that mount is baked in to the image. [14:42:35] In which case... [14:42:37] * andrewbogott looks [14:56:53] Coren: if I create new instance [14:56:59] Coren: would it be equiad? [14:57:21] that means, if I make it now, I will not have to care about the migration process at all? [14:57:24] petan: Possibly (ask andrew). It'd also probably not quite work yet. [14:57:36] icinga is working [14:57:44] andrewbogott: can you explain this a bit [14:58:15] petan: Everything that you see is still strictly pmtpa [14:58:35] andrewbogott: eh... I see icinga, is it pmtpa? [14:58:40] @labs-instance icinga [14:58:41] Sometime soon I hope to switch things over so that you can create instances in eqiad, and then subsequently we'll disable new instance creation in tampa... [14:58:48] @labs-info icinga [14:58:48] [Name icinga doesn't exist but resolves to I-00000a48.pmtpa.wmflabs] I-00000a48.pmtpa.wmflabs is Nova Instance with name: icinga, host: virt8, IP: 10.4.1.88 of type: m1.medium, with number of CPUs: 2, RAM of this size: 4096M, member of project: nagios, size of storage: 55 and with image ID: ubuntu-12.04-precise [14:59:01] but, I believe the sections in wikitech are clearly marked with regions. [14:59:01] 10.4.1.88 is which DC [14:59:10] The dc is in the hostname [14:59:12] Coren: looks like something went down, eqiad/replicas are unreachable. in ganglia the replicas are dead [14:59:18] petan: Hint: the hostname says 'pmtpa' :-) [14:59:29] petan, ip address ranges: https://wikitech.wikimedia.org/wiki/IP_addresses [14:59:30] aha [14:59:38] (That page is a bit unreliable, but I'm trying to keep it up to date) [14:59:41] (at least the labs bits) [14:59:58] hedonil: Ow. I know some network changes just occured, lemme ask around. [15:00:39] Coren: 'k. https://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [15:01:03] hedonil: The actual machines are up and running, it's almost certainly network. [15:01:38] Coren: ack. [15:01:42] hey coren and andrewbogott: I can't remember whether we can define our own instance types or whether we are stuck with the existing ones :-] [15:01:59] Hmm... still quite a few things that seemingly don't have an up to date puppet setup looking at monitoring :( [15:02:24] you mean regarding OS, or regarding size? [15:02:25] I could use a 4 CPU instance, but dont really need 8G RAM nor 80G storage :-D I dont want to still resources from others [15:02:42] Oh, don't worry about ram and storage... [15:02:56] we support overprovisioning, so that space isn't actually consumed unless you use it. [15:03:03] well, storage at least :) [15:03:17] The ram<>cpu mapping pretty much makes sense, otherwise you make the hosts wonky eventually [15:03:44] Damianz: but how would we mine bitcoin otherwise?! :P [15:03:59] ASICs [15:04:14] Mining bitcoins on CPUs or GPUs is pointless... you'd spend more in electric [15:04:22] Damianz: only if you're paying for it :P [15:04:34] andrewbogott: thanks :-] [15:05:02] andrewbogott: and CPU is overprovisionned as well ? :D [15:05:33] !log integration adding in 4CPU instance integration-slave02 [15:05:36] Logged the message, Master [15:05:53] and another crazy question: can we migrate an instance from a type to another one? :-D [15:07:56] !log integration adding in 4CPU instance integration-slave03 [15:07:57] Logged the message, Master [15:11:54] hashar: Hypothetically, yes, although Ryan's experiments did not work to plan and ended up trashing the instances so we're leery of trying again. :-) [15:12:09] Coren: thanks, I will delete the old instance so :-] [15:12:13] luckily everything is puppetized [15:12:14] \O/ [15:13:46] andrewbogott: Do you have a living test instance in eqiad atm? [15:14:29] Coren: sure, you can try bastion-eqiad.wmflabs.org [15:15:43] andrewbogott: That's going to be my victim for the next couple hours then. :-) [15:18:22] when are wikidata instances moving? [15:18:53] man, we shouldn't even be talking about this in this room, it's going to make everyone freak out. [15:19:00] :-] [15:19:01] * aude freaks out [15:19:08] aude: We're going to be sending a timeline soon. In practice, expect a window during which projects can move on their own. :-) [15:19:09] aude, rest assured there will be clear deliberate steps, well-broadcast on the labs-l list :) [15:19:15] ok, great [15:19:21] I will be more than happy to migrate the "integration" instances [15:19:29] and beta as well if you are brave enough :D [15:19:32] (needs NFS ) [15:19:54] andrewbogott: Oh, did we change away from our initial of hapazard, random steps? :-P [15:20:10] plan* of ... [15:20:13] I'll be happy when there's a duplicate tools setup in eqiad ;) [15:20:23] It'll be haphazard and random right up to the point where volunteer projects are involved. After that, hopefully less so... [15:23:00] Is tools feeling ok? [15:24:06] Damianz: Network issue to the databases. Should be back shortly. [15:24:47] Only the databases? [15:25:31] Databases explains no reverts, but not the really really really slow edit feed (which doesn't touch the database)... maybe api/irc is a little broken too? [15:25:49] * Damianz waits anyway as one bit working doesn't really help... but does keep stiki/huggle users from complaining [15:26:13] !log deployment-prep rebooting bits cache [15:26:14] Logged the message, Master [16:04:20] hi there [16:04:35] another stupid user problem. what would the following mean: [16:04:39] local-render-tests@tools-login:~/$ webservice status [16:04:40] Your webservice is scheduled: [16:04:41] queue instance "continuous@tools-exec-08.pmtpa.wmflabs" dropped because it is temporarily not available [16:05:22] is this intentional? can somebody restart the queue? [16:07:34] JohannesK_WMDE: It looks like the job is already running (2606866). [16:08:16] JohannesK_WMDE: It's not an issue. It's just informing you that it couldn't use -08 if it wanted to. Some of the possible outputs from qstat are noisier than needed. [16:10:30] OK, that doesn't seem to be the issue. are there database problems? [16:10:32] OperationalError: (2003, \"Can't connect to MySQL server on 'enwiki.labsdb' (110)\") [16:11:05] seemingly, yes [16:11:28] (just read the ML... probably related to yesterday's outage) [16:12:32] Yes, network issue being fixed now. [16:29:29] can connect to sql again, yay... [16:41:57] Indeed, bots came back to life while I was in the shower [16:54:14] Coren: any chance you could add User:Christopher Johnson (WMDE) to the tools project please? :) [16:54:28] ChrisJ_WMDE: ^^ [16:54:55] addshore: Sure, gimme a sec. [16:55:16] Cheers :) [16:55:36] addshore: {{done}} [16:55:45] Thanks muchly Coren ! [16:56:00] thanks ! [17:07:44] scfc_de: I'm about done for the night, do you want to investigate 61413 further? (That is your bug, right?) [17:11:06] anyway, I've now reduced that bug to a genuine puppet issue rather than the vague labs weirdness it was before [17:19:43] andrewbogott: It's not urgent as long as I can work around it. [17:19:53] andrewbogott: So: Good nicht! :-) [17:20:54] I updated the bug… should be enough to go on if you want to work up a puppet patch to fix it. Otherwise I can work on that tomorrow. [17:20:57] 'night! [18:53:49] labstore.svc.eqiad.wmnet:/project/bastion/home 30T 0 30T 0% /home [19:00:03] Yummy. [19:00:47] (Though /home should be little compared to /data/project?) [19:12:44] They're the same underlying filesystem. But so you know, that 30T is what is currently allocated and I got room to grow. [19:18:12] One filesystem for everything? Hmmm. The recent outages made me think that we should maybe move towards more partitioning, not less :-). But that's something that can always be done afterwards. [19:27:38] Coren: replicas are accessible again, thx. [19:28:04] Coren: but there are recurrent outages since ~ 3 days (HY000/2013): Lost connection to MySQL server at 'reading authorization packet' [19:28:22] something we have seen before [19:29:04] Coren: and btw - could you fix this too https://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [19:29:18] would be really great :-) [19:29:56] hedonil: atm, my priority is to get the eqiad labs working. Something we'll all benefit from. :-) [19:30:32] Coren: yeah 30TB sounds wonderful [19:31:03] @replag [19:31:17] (darnit. no bot) [19:34:56] Coren: you are right, keep focus on eqiad, the ganglia thing is not that urgent. I'll open a bz for that [19:38:15] Coren: but I' curious: how long does it take to sync the old and the new tools/labs over network? days? weeks? already started? something we can do? [19:39:50] We'll have more info coming on labs-l shortly, but the short of it is "very long by default, so we'll ask people to clean up first so as to not copy Gigabyte logs and stuff". Also, both "realms" will be available to users before anything is moved, so people can copy selectively early. [19:40:32] Are we getting eqiad tools soon? ;) [19:41:26] * hedonil is looking forward to that :-) [19:53:18] . [19:53:36] * petan summon the mighty bota [19:53:40] @hi [19:53:42] @replag [19:53:43] Replication lag is approximately 05:06:14.0925310 [19:53:48] wow [19:53:53] that I call a lag [19:54:21] @replag [19:54:22] Replication lag is approximately 05:06:52.8136550 [19:54:56] * Damianz knocks the a off wm-bot [19:55:17] bota <3 [19:55:37] @voice wm-bota [19:55:42] aha [19:55:47] this needs to be fixed :D [19:55:48] @op [19:55:57] bah [19:55:59] @help [19:55:59] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.0.0.1 my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [20:23:22] Coren: YMMV*. I think [20:24:08] (or else, what do you mean? :)) [20:24:55] Yeah, I mean YMMV. Braino. [20:37:12] Coren, petan: replication lag on s1 is up to 5 hours, 49 minutes; looks like a stuck query again, maybe? [20:38:28] russblau: Probably. I go check. [20:41:29] ty [20:44:17] Ah, hm no. Looks like a lingering network issue. [20:50:46] hi I am a new user in tools, and my replica credentials are not working. Can someone take a look? [20:54:15] ChrisJ_WMDE: how are you connecting to the sql servers? [20:55:12] mysql --defaults-file=~/replica.my.cnf -h wikidatawiki.labsdb [20:59:14] ChrisJ_WMDE: What error message are you getting? [20:59:53] valhallasw: I have tried other replicas as well, and I am getting access denied for user 'U4208'@10.4.0.220 (using password:YES) [21:01:18] Coren: ERROR 1045 (28000): Access denied for user 'u4208'@'10.4.0.220' (using password: YES) [21:02:01] Hm. I see what's happening. It's a side effect of the current issue with replication; new credentials are also not being propagated properly. [21:02:55] I'll fix itself as soon as network returns. [21:03:17] ok. thanks for checking it out. no worries ~ [21:19:44] i can't access tools-login... [21:20:08] "Permission denied (publickey,hostbased)." [21:20:45] i also had a now gone weird access problem [21:22:16] hey, anyone else having trouble sshing in? [21:22:35] slaporte and I can't get into tools-login.wmflabs.org [21:22:45] last night it was fine though [21:25:06] mahmoudhashemi: Works for me. What is your shell account name? [21:25:55] wooo, its working again \o/ [21:26:17] o_O [21:26:20] @replag [21:26:21] Replication lag is approximately 06:38:51.2646710 [21:26:26] yep, it's back [21:26:29] Coren ^ [21:26:40] replag isn't going to get any better until the network issue is fixed. viz. topic [21:27:00] hm [21:28:56] thanks! [21:39:17] hm, my instance's home directory disappeared ... [21:40:20] type ls ~ [21:40:24] it will be back [21:40:40] :o [21:41:47] thx petan :) [21:51:37] Replication should be catching up now. [21:56:44] @replag [21:56:45] Replication lag is approximately 00:00:00.9317530 [21:56:52] @replag [21:56:53] Replication lag is approximately 00:00:00.7259740 [21:58:06] @replag [21:58:07] Replication lag is approximately 00:00:00.7688650 [21:58:25] petan: I don't think you can expect to go much below that. :-) [21:58:33] :( [21:58:59] I want 0.000001 [22:00:24] petan: That'd probably require you to break the speed of light, or at least causality. :-) [22:00:54] bah [22:04:34] Coren: technically, that's the same thing [22:04:53] ;-) [22:04:55] valhallasw: More precisely, one implies the other. :-) [22:05:48] But in one direction only. If you break C then you definitely kicked causality out the window; but I can think of interesting ways to break causality that leaves C alone. :-) [22:13:31] This almost makes me want to study general relativity again.