[00:00:16] hey andrewbogott [00:00:25] do you know how to create a testing project on wikitech-test.wmflabs.org? [00:01:49] andrewbogott: https://www.josephjewelry.com/mens-wedding-rings/1364.php?page=2 [00:02:09] Not the jeweler we are using; but that's the model we are taking. [00:04:39] That looks awesome! Very stylish. Is it so heavy that you won't be able to lift your hand? [00:05:54] Krenair: Are you a cloud admin? [00:06:06] Now that you mention it, I think that maybe project creation is broken for me too :( Which wiki are you using atm? [00:06:59] andrewbogott, I could probably make my account become one if I wanted. I am a member of labs' openstack project so can ssh to nova-precise2 [00:07:21] andrewbogott, https://wikitech-test.wmflabs.org/wiki/Main_Page [00:08:02] Krenair: the easiest thing to do is just pick a project on wikitech-test that was copied over from production and mess with that one. [00:08:08] Or I can make you a new one, or you can cloud-admin yourself. [00:08:15] (presuming that things are working, which, I'm unsure.) [00:08:38] I'm not sure how I'd copy one over [00:08:49] andrewbogott: Well, it's denser than gold so I wouldn't want a bracelet but I think I can cope. :-) [00:08:53] * Coren reads the patch now. [00:09:03] Krenair: They're all copied over already, is what I mean. [00:09:12] Copied but empty. [00:09:32] (But, more importantly, it's harder than steel which means it'll last on my hands more than weeks before it looks like crap) :-P [00:10:07] Coren, does that also mean that if you end up in an ER with a freak accident they'll have to cut the bone vs. the ring? [00:10:16] Maybe ER's have special tools for that kind of thing these days :) [00:10:48] Coren, andrewbogott: you got some time today/tomorrow to test the labs-migration-assistant? [00:10:54] Heh. I doubt it; anything that could deform that ring I expect would have puréed the hand first. :-) [00:11:27] (Also, titanium carbide has basically zero ductility, so it'd break before it'd deform) [00:11:30] drdee: Yep, probably later today [00:11:38] Coren: OK, that's reassuring :) [00:13:07] andrewbogott: cool, please have a look at shell commands as well because that determines whether the checks are actually meaningful. particularly the check for using /data/projects feels flaky [00:13:46] also; is there a URL that gives a list of all the labs instances that I have access to? [00:14:14] right now the script is only analyzing hosts that you are administrator of but that might be too restricting [00:15:17] drdee: I think if you go to 'manage instances' and fuss with the filter widget on the top you can select every project you belong to. and that will ultimately show you all instances in all those projects. [00:15:50] right but that's hard to do from a cli ;) [00:16:10] oh... [00:16:36] I think there's no good way w/out running on a production host. [00:17:51] ty Coren [00:38:32] Coren, can anything be done to slim down the labs-db instance on virt11? [00:38:35] I'm guessing not :( [00:41:24] andrewbogott, ugh, I need 2-factor auth to access novainstance on wikitech-test now? [00:41:45] Krenair: that's been true since almost day one… it should work just fine. [00:41:56] Presuming you have a smartphone :/ [00:42:16] yeah, ok [00:46:55] andrewbogott, okay, how do I create an instance again? [00:47:10] On 'manage instanges' there should be a link [00:47:14] If your a project admin [00:47:18] Rather, a link per project [00:47:59] why is manage instances not linked on the sidebar? :/ [00:48:34] it… is? [00:48:43] Twiddle the 'Labs Projectadmins' widget [00:48:50] oh ok, have to be a project admin. [00:48:54] ok now I got it [01:50:35] gwicke: I'm going to move parsoid-spof right now... [01:56:12] …or not, something is broken :( [02:00:18] Hm… the labs gui on wikitech is currently down. I'm working on it... [02:02:08] ok, fixed [02:24:12] andrewbogott: tools-db? Not a chance; it's already straining under its size. [02:24:34] Coren: Alas [02:24:38] andrewbogott: Tool labs users are eagerly awating the move to eqiad, and that's one of the reasons. :-) [02:26:12] It's using about 1/4 of the storage on virt11… but probably if I move it to a different host a million volunteers will cry out in anguish. I'll move some smaller ones instead. [02:27:00] andrewbogott: It's gotten bad enough that I have to regularily clean up binlogs just to keep up. [02:27:18] * Coren also cannot wait for tools-db on real hardware. [02:39:50] andrewbogott: Just out of curiosity, are you moving from virt11 to virt10? [02:40:02] Just now I moved something from 5 to 12 [02:40:05] Because virt10 just popped up an alert about being at 3% [02:40:12] Dammit [02:40:20] Anyway, 12 still has lots of room [02:40:34] But, yeah, ryan fixed the last crunch on 11 by moving to 10. [02:41:20] Lesson learned: Don't overprovision so much in eqiad :( [02:42:59] !log testlabs deleting instance asher-m1 because asher is gone and we need the space [02:43:01] Logged the message, dummy [02:51:09] !log visualeditor moving parsoid-spof and ve-roundtrip to virt12. Brief downtime and reboot will ensue [02:51:10] Logged the message, dummy [03:10:51] drdee: Re querying from CLI, you could use Semantic MediaWiki's API with "[[Member::User:User Name]]" to get a list of all projects a user is a member of (don't know how to filter for projectadmin), and then use "[[Resource Type::instance]] [[Project::Project Name]]" to get a list of the individual projects' instances. (I use a similar query to refresh my ~/.dsh/group/tools & Co.) [03:11:33] sweet, do you have code example? [03:12:34] Coren: Are you still working this evening? Since Ryan_Lane isn't about I'm going to deploy the servicegroup code... [03:12:37] https://wikitech.wikimedia.org/w/api.php?action=ask&query=%5B%5BResource%20Type%3A%3Ainstance%5D%5D%5B%5BProject%3A%3Atools%5D%5D%7C%3FInstance%20Name&format=json [03:12:40] wouldn't mind a hand in testing/verifying [03:15:30] welp [03:22:36] andrewbogott: I'm around, though otherwise distracted. [03:22:58] Coren: Question one, has loading https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup for tools always been incredibly slow? [03:23:15] Hm, maybe wikitech is freaking out :( [03:24:00] andrewbogott: Yeah, it's getting increasingly so. I think we have some sublinear and project groups. [03:24:24] https://wikitech.wikimedia.org/wiki/Special:NovaDomain is snappy, so is non-openstack stuff. [03:24:52] You wouldn not say that it's way slower than earlier today? [03:25:03] Nova* is. [03:25:08] I don't know, really, just nervous since that's the bit I just changed. [03:25:21] hm [03:25:22] I haven't gotten an answer from any of them yet. [03:25:39] Well, lemme roll back this patch and see if it matters [03:25:51] Special:Nova{Project,ServiceGroup,Resources} [03:27:47] Yep, new patch is what's slowing it down. [03:28:01] I wonder, is doing an ldap query for a nonexistent entry slower than for an existing one? [03:28:39] In theory this should only be making 2x as many queries, but it is much more than 2x slower [03:28:54] it might make sense to move the parsoid rt testing off labs at some point [03:29:11] and anyway why would it make non-servicegroup stuff slower? [03:29:28] we are trying to catch perf regressions, but with all the uneven load balancing etc the data is pretty worthless [03:30:26] gwicke: if it's useful to have e.g. all your instances hosted on the same virt host, I can arrange that. [03:30:29] But not just now :) [03:31:26] andrewbogott, we are using most of labs cpu time already when doing a rt run [03:31:39] that would only slow us down and push one machine to an extremely high load [03:32:31] true [03:32:33] in theory we have 36 cores or so busy with rt testing [03:35:05] Coren: OK, reverting that patch for now. Maybe need to do some ldap indexing or something… anyway, I won't be deploying that tonight. [04:00:58] andrewbogott: Wait, why are you doing two lookups anyways? Wouldn't you just write to the new entry but not read it? [04:01:44] The second lookup is in there for data integrity reasons… maybe not needed. [04:02:40] Might be worth a try to deploy without; we do have a script to sync up after all; and if it works normally without we know there's a missing index. [04:02:57] (Or at least that it's the new lookup that slows things gown) [04:04:15] Yeah, I'll go back and make sure the code still makes sense without the second lookup. [04:04:32] (The second lookup is, for instance, to see if there's a record before e.g. adding members.) [04:15:56] Krenair, working on nova-precise2 atm? [04:17:32] Krenair: I may have just clobbered your pending patch there… feel free to replace [04:17:38] But, let me know... [04:48:21] Coren, does every wiki have a jobqueue? [04:48:40] I'm looking at a bit of unfamiliar OSM code that creates a job and submits it… I can't tell if it's ever actually running. [04:48:44] andrewbogott: Necessarily so; most of the category management and links are managed that way. [04:49:14] Is a 'Job' in mediawiki php the thing that the jobqueue runs? [04:49:17] Or are they unrelated? [04:49:30] andrewbogott: But on a low-traffic wiki that queue can pile up something fierce. [04:49:39] Yes it is. [04:49:43] How can I assess the status of the queue? [04:50:27] maintenance/showJobs.php [04:50:43] also maintenance/runJobs.php to drain the queue by doing all the work. [04:51:07] showJobs says '0' [04:51:23] So nothing piled up, at least. That's a good thing. :-) [04:51:35] Sort of, except it doesn't explain why this code isn't executing [04:51:55] I know the sysadmin side of mediawiki pretty well, not so much the innards. [04:52:55] But I'd expect that unless the job submission caused an exception, if the job queue is empty then jobs have been run. [04:53:13] I think it must be happening, and the logging is just broken [04:53:23] Due to the indirection of the queue and reliance on a global to find the logfile [04:53:35] * andrewbogott guesses [04:54:19] Anyway I need lunch before I can think about this anymore. [04:54:21] Thanks for your help! [05:45:05] Coren: here's another mw question… how can I get a list of pages edited in the last few minutes? [05:46:46] nm, found it [06:23:30] heya, I can't ping bastion1.pmtpa.wmflabs, dig doesn't cough up an IP. ebernhardson was having problems ssh'ing to our labs host earlier. [06:24:49] spage, ok, looking... [06:25:56] duh, that's my alias. `dig bastion.wmflabs.org` works , but not pingable [06:26:14] ping just kicked in, yo [06:27:14] ok, so, working? [06:27:23] I was about to say, it's bastion.wmflabs.org, not bastion1.wmflabs.org [06:28:08] andrewbogott: thanks for looking into it. [06:28:14] 'k [06:28:15] ssh -v -v says ssh_exchange_identification: read: Connection reset by peer [06:28:36] hm… wait, I'm lost. Working or not working? [06:28:52] I mean `ssh -v -v -v bastion.wmflabs.org` doesn't work. Sorry, let me get my mouse :) [06:30:43] !log bastion rebooting bastion1 [06:30:44] Logged the message, dummy [06:31:40] spage: reboot fixed it. Probably was oom [06:33:00] yup fixed, thanks. I saw ErikB's report and tried it myself. back to bed :) [06:59:55] I don't seem to have any replica.my.cnf file... how can I get one? [07:44:51] andrewbogott: it was my dns code change that slowed things down? [07:45:09] Ryan_Lane: Nope, service group/ldap stuff [07:45:21] Your changes are in, with a few modifications. [07:45:25] cool [07:45:38] what's the service group stuff do? [07:46:00] btw, want to see something neat? :) http://trebuchet.wmflabs.org/ [07:46:17] Ooh, a gui! [07:46:25] yeah. just for reporting for now [07:46:31] The service group stuff is… transition from local-groupname to projectname.groupname [07:46:41] moving everything under ou=servicegroups [07:46:59] So, a bunch of things in a new ou with no indexes… I'm guessing that's the issue but haven't investigated yet [07:47:26] ahhhh, ok [07:48:01] well, maybe. we'll need to see if those attributes have existence indexes set [07:48:46] Has maintenance/puppetValues.php worked any time lately? [07:49:04] haven't tried :) [07:49:10] 'k [07:49:12] I guess I should have tested that when I did the dns changes [07:49:23] I think it gets the instance, so it should work [07:49:31] Well, it's definitely broken /now/ :) I just wonder if it was broken before... [07:49:32] I checked for references to the functions I changed [07:49:34] oh [07:49:38] heh. damn [08:39:22] I don't seem to have any replica.my.cnf file in my instance (find / -name replica.my.cnf = zero result)... how can I get one? [08:39:47] I-000005b0.pmtpa.wmflabs [08:43:05] Nicolas_: Sorry, I don't know what a replica.my.cnf is. Why would you expect your instance to have one? [08:43:40] Nicolas_: that's a toollabs only thing. You don't get it in labs instances by default [08:44:03] That would've been my next guess :) [08:44:04] andrewbogott: replica.my.cnf is the credentials for a service group on tools for access to the replica mysql [08:44:21] * andrewbogott nods [08:47:51] Thanks! I just want to execute a read-only SQL query, and the documentation at https://wikitech.wikimedia.org/wiki/Nova_Resource%3aTools/Help#Database_access tells me I need this file [08:50:18] Is there a way to run such SQL queries without having to download dumps? (Wikidata by the way) [08:54:14] I'm pretty sure that you can do this from your project, but I don't know how :( Best to try again when there are more people about, or email the labs list. [08:57:24] Nicolas_: why aren't you just using tools? [08:57:25] !log moving math-semantics to a new virt host to avoid a storage crunch. This will reboot the instance. [08:57:25] moving is not a valid project. [08:57:43] !log math moving math-semantics to a new virt host to avoid a storage crunch. This will reboot the instance. [08:57:47] Logged the message, dummy [09:02:43] !log search moving/rebooting search-test [09:02:44] Logged the message, dummy [12:05:44] hi Coren [12:05:53] one of my tasks stuck in state=dr? [12:05:56] I qdel'ed it [12:06:11] 2188529 0.25529 updatedyk local-liange dr 01/15/2014 07:00:16 task@tools-exec-09.pmtpa.wmfla 1 [12:06:31] how come we have requests for projects from 2012 :) [12:06:51] https://wikitech.wikimedia.org/wiki/New_Project_Request/Titleblacklist [12:07:03] 27 December 2012 , completed no [12:09:06] New Project Request/Extension:BlockandNuke , New Project Request/Global Talkpage Notify, New Project Request/wikidataquery [12:09:44] should there be a "won't fix"/rejected or something? [12:53:27] hey guys, i can't configure any existing instance in my project anymore [12:53:29] The requested host does not exist. [12:53:50] but only on "configure", besides that they are up and running, i can ssh to them, they are shown as active and so on [12:54:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=60167 but i just realized it happens on all instances i try, not just the new one [12:54:12] aren't you the person who people would normaly ask this kind of question? :P [12:54:43] petan: nah, i'm not involved in the migration [12:54:54] i figure it might be related to a change there [12:55:26] petan: can you still click "configure" on an instance of yours without that error? [12:55:35] you don't need to change anything, just the link [12:57:34] let me check [12:59:43] web interfaces really are slow... <3 terminals [12:59:53] if there was a terminal interface for labs I would have already check this [13:00:22] yes I am getting the same error [13:00:27] @labs-instance bots-labs [13:00:33] @labs-instance-info bots-labs [13:00:39] thanks, if you feel like confirming on that bug above, would be nice [13:00:41] !ping [13:00:41] !pong [13:00:54] confirming?? what is that [13:01:02] I didn't know we have such a thing in bugzilla :D [13:01:05] saying that you see the same error on another project [13:01:11] ah that [13:01:24] because next they will ask if it's just one project or global issue [13:01:29] I thought bug must be confirmed now for it to be opened :D [13:01:49] hehe, no :) but when you resolve it it needs "verified" to be really resolved:) [14:59:35] Coren: https://gerrit.wikimedia.org/r/#/c/98307/ ? [15:03:12] paravoid: Working on a fix atm, but I'll be with you after. [15:04:28] right, I see [15:06:29] ok… is it just the puppet config that doesn't work on wikitech, or are there other issue? [15:06:36] s? [15:07:37] coren, mutante? [16:02:27] paravoid: All done. Need me to do that manual merge? [16:08:45] * andrewbogott is Zzzzzzzzzzzzz [16:11:22] paravoid: I've commited a merge. [16:11:48] <^d> G'morning Coren. [16:12:35] Heyo. ^d. Be back soon, needz tea! [16:12:39] i am trying to connect to tools-exec-07.pmtpa.wmflabs but receive a "Connection closed by UNKNOWN" error, can someone else reproduce? [16:12:51] <^d> Coren: Enjoy your tea :) [16:13:19] Ima fetch it, then be back. [16:22:58] * Coren now has tea. [16:23:42] drdee: Works for me. [16:24:08] dr0ptp4kt: But you /have/ to log in through -login though. The exec nodes use HBA [16:24:20] drdee: ^^ [16:24:28] St00pid autocomplete. [16:26:27] thanks Coren :) [16:34:07] !ping [16:34:07] !pong [16:34:11] ok [16:34:25] <^d> Coren: So yeah, reason for the ping was re: LDAP. Wanted to know if we had any change since yesterday [16:35:23] ^d No. Ryan_Lane should appear shortly though and I'll corner him. Normally, I'd just restart the ldap but I don't know if they're doing something on virt1000 atm. [16:35:32] <^d> Okie dokie [16:36:47] Coren: ? [16:37:02] liangent: ! [16:37:05] my job 2188529 stuck in state=dr for several hours [16:37:46] I qdel'ed it, and it has already run much longer than expected [16:38:09] I go check what it's going. [16:39:30] It's long dead, it looks like gridengine failed to notice. How dumb. [16:40:41] yeah it's expected to finish in minutes, but that task has been kept as running for days, having following task runs blocked [16:40:48] (I'm using -once) [16:41:08] liangent: Gone. [16:41:27] liangent: Pro tip: qdel -f can help in situations like this. [16:41:33] Coren: ok will it happen again? [16:41:40] But don't hesitate to poke me if you're unsure. [16:42:41] liangent: It shouldn't; that was caused by the exec node crashing and being restarted and gridengine failing to clean up after itself. By definition, it's a freak occurence. [16:44:25] Also, I note there is a gridengine setting I can use to be more forcible automatically in cases like this. I'll turn it on in eqiad. [16:46:48] eqiad? [16:47:08] I know that's a cluster but I don't know the layout of the grid [16:47:21] (03PS1) 10Diederik: Discovered that Semantic Mediawiki as an API that I can query! [labs/migration-assistant] - 10https://gerrit.wikimedia.org/r/108063 [16:47:23] also queues seem to look like *.pmtpa [16:47:47] (03CR) 10Diederik: [C: 032 V: 032] "Ok." [labs/migration-assistant] - 10https://gerrit.wikimedia.org/r/108063 (owner: 10Diederik) [16:50:55] (03PS1) 10Diederik: Update documentation. [labs/migration-assistant] - 10https://gerrit.wikimedia.org/r/108064 [16:51:17] (03CR) 10Diederik: [C: 032 V: 032] "OK." [labs/migration-assistant] - 10https://gerrit.wikimedia.org/r/108064 (owner: 10Diederik) [16:53:43] liangent: pmtpa and eqiad are our current data centers. The openstack/labs setup is in pmtpa; we're moving to eqiad soon (moar power!) [16:54:20] More resources, bigger filesystems, and next to the database replicas as opposed to 26ms away. [17:53:16] Coren: Could you send a short note to labs-l? [17:54:22] scfc_de: Am composing it now. [17:56:03] !petan-build [17:56:03] make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-custom [18:19:13] labs all better now Coren? https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#IRC_feeds_down.3F [18:20:07] ShoeMaker: No. As explained on labs-l, there is a workaround in place that solves some of the issue but the fibers are still broken and that affects communication with eqiad. [18:21:04] Would you please note on the linked discussion quickly if you could? I'd do it myself if I had a better knowledge of what what going on. [18:21:47] Coren: Nothing has come across labs-l for me [18:22:42] Betacommand: Sent ~17:57 UTC; I got it through labs-l. [18:22:51] I'm apparently not on that list... I only get wikitech-l and wikitech-ambassadors [18:23:08] Betacommand: http://lists.wikimedia.org/pipermail/labs-l/2014-January/002014.html [18:24:35] tl;dr: someone played with a backhoe. No ETA for full return of connectivity, but there is a workaround so that external communication with pmtpa is restored. [18:25:30] Thanks marc for the note on VPT [18:27:56] I signed up for labs-l too while I was at it... [18:31:31] Coren: An additional impacted system is that labs is unable to access production wikis. So all tool labs bots are effectively down, even if they don't use the replicas. [18:45:45] Is labs supposed to be all working now? sql commonswiki_p doesn't seem to be working for me [18:46:20] anomie: Oh, hum. Yes, you're correct that they would; the default routes would all try to hit eqiad directly and not go outside. [18:47:18] bawolff: No, it's not. The link between DCs is still broken even if tampa is reachable from the outside. No ETA yet, as we have to wait for the people with backhoes and fiber repair gear to do their part. [18:47:35] * bawolff suggests the topic be changed :) [18:47:59] * Hasteur endorses this viewpoint. [18:57:22] My bots seem to be online, reading IRC, but unable to read wiki pages?? [18:57:47] Beetstra: That seems to be consistent with the current network breakage. [18:57:57] Beetstra: major outage [18:57:58] OK, just checking [18:58:05] Yeah, I know, I am following it [18:58:52] I just notice that LiWa3 is failing to read its settings, so fails to come on irc (well, it does not know where to go .. ) [19:12:09] hi petan [19:12:18] the tool asks me to blame you [19:12:19] liangent@tools-dev:~$ sql meta_p [19:12:19] This is unknown db to me, if you don't like that, blame petan on freenode [19:12:29] see https://bugzilla.wikimedia.org/show_bug.cgi?id=48626#c8 [19:13:42] liangent: the db is unreachable due to fiber cut [19:13:52] i think [19:15:03] mutante: Correct. It lives in eqiad [19:15:21] The good news is, the broken fiber(s) have been located and crews are on their way. [19:16:19] so max. blame it for saying "unknown" instead of "can't connect" [19:17:20] mutante: really? I think the cause is that meta_p is not listed in /etc/hosts or some other database list [19:17:42] because `sql enwiki` creates a connection failure instead of an unknown db [19:18:23] well we can wait for the fiber to be back and check the sql command again [19:18:33] liangent: sorry, didn't look at details, but connection errors wouldn't be surprsing right now [19:18:37] i'd give it an hour or 2 [19:20:01] Wait, meta_p? [19:20:07] That should be metawiki_p [19:20:39] liangent: ^^ [19:21:25] Coren: huh but it's mentioned as meta_p everywhere in that bug [19:21:31] Ctrl+F for metawiki returns nothing [19:22:02] petan: you're a huggle dev, right? [19:22:06] Oooo. Okay. Ignore me. I misunderstood what you were trying to say. :-) [19:22:24] liangent: There isn't /a/ meta_p database; that database exists on every shard. [19:22:54] btw, Coren: is there some role i can give to my vm so that i automatically get updated versions of /etc/hosts, ipchains config, and little helper scripts like /usr/bin/sql, like in labs? [19:23:09] From: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#IRC_feeds_down.3F --> Gotcha, thanks for the update. For anyone else reading this that uses Huggle, change your options to force the software to use the API queries rather than the IRC feed. System > Options > uncheck "Use IRC feed for recent changes if possible". Cheers! — MusikAnimal talk 1:38 pm, Today (UTC−5) [19:23:48] Why are people having to untoggle this? If IRC feed is not possible, shouldn't Huggle already know that and not force users to workaround by changing settings? [19:24:24] JohannesK_WMDE: Not readily, but you could create a role for it by basing yourself off toollabs::exec_environ. [19:24:57] Coren: how do i start with creating a role? i'm going to have to do that anyway at some point... [19:25:28] Coren: and do you think it's my responsibility to get a dblist and choose an arbitrary db to connect then "use meta_p" to query meta_p, or your responsibility to add meta.labsdb to /etc/hosts, or petan's to handle it in sql command? [19:25:28] ShoeMaker: I expect it's automated if the irc feed itself is down, but can't really cope or notice that nothing's happening. [19:25:52] Shouldn't it? [19:25:58] ShoeMaker: IRC just came back up [19:26:16] Shouldn't have a timeout and default to pulling API if it fails to get anything? [19:26:16] liangent: That's an "interesting" question. I'm not sure how useful command-line access to those DBs really are though. [19:26:40] ShoeMaker: How would it know the difference between "nothing happens" and "something happens but you don't see it"? [19:27:03] If it's not getting any data, then it should timeout and fail and default to API. [19:27:46] ShoeMaker: That might be reasonable most of the time for enwiki, but many smaller projects have longer periods between changes that'd make that unreliable at best. [19:28:02] ShoeMaker: But I'm not the maintainer; they're probably the better ones to ask. :-) [19:28:23] lol That's why I was pinging petan as I remember him working on it. :p [19:28:47] Coren: well a direct access to meta_p by tools is useful too, eg, to provide a wiki list for users to select. though there's no petan's task here [19:30:24] I guess some tool authors would simply choose enwiki to do such queries, like "aawiki" used in various maintenance scripts. but I don't really think this is an elegant solution [19:33:47] JohannesK_WMDE: I'm working on exactly that role at https://gerrit.wikimedia.org/r/#/c/107010/. [19:34:03] Damianz, you run ClueBot right? [19:34:12] (Doesn't work yet in any way whatsoever; just some notes written down.) [19:35:09] scfc_de: nice, i'm keeping an eye on that [19:35:49] liangent: If you want to be really nice about it, you can connect to any random shard but just picking one is okay. Anything else will have the same end result anways. Which shard should 'sql' use? [19:36:55] JohannesK_WMDE: Your best bet is to base yourself off a role that does something similar, and study it. Check the operations/puppet repo out and look in modules/toollabs for instance. [19:38:45] Coren: I don't know but there's even not a list of all shards, so I can pick one programmatically.. [19:38:52] *can't [19:39:09] liangent: s[1-7] [19:39:55] liangent: That's kinda documented all over the place, but it's clearly not readily findable and should be mentionned in the tool labs doc. [19:41:56] Coren: well not in this aspect. when I knew these things for the first time there were just s1 to s3 [19:42:07] I imagine there'll be s8 in the future [19:42:37] liangent: Probably yes, but picking from s[1-7] would remain a safe bet regardless. :-) [19:42:59] liangent: You're investing *way* too much time in the question how to access that table :-). [19:43:02] But honestly, I don't know of a "right" solution to that condurum. :-) [19:43:45] Hum, so, fiber cut. No wonder my bot is stuck :-) [19:44:31] Yeah, apparently someone screwed up and didn't call digsafe or digsafe screwed up or someone called and then ignored the writing on the ground or.... Anyways... [19:44:33] scfc_de: because I can't do what I want to do now ... to really access it :( [19:51:16] Technical_13: Apparently (at least) three people screwed up :-). But a bit more monitoring would probably have helped. [19:53:05] @info [19:53:05] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-labs.htm [19:53:45] !bang [19:53:45] Bang!! [19:53:48] ^^ [19:53:50] xD [19:54:56] sounds like a wiki-vacation for people who use the output of bot runs... [19:55:53] I'll just run my bot locally for the time being :D [20:10:11] scfc_de: Or one set of people screwed up three times. Or screwed up BIG. :-) [20:11:57] Coren: Speaking of which: Are all three of these fibers in the same place, or "coincidentally" did three fibers get cut in three different places all at once, or is someone at the fiber company so bad at their job that they didn't notice the first two disappeared until the third one got cut too? [20:12:32] anomie: I don't know. The information we got is partial at best. I think the latter. [20:13:07] * anomie is confused as to how to interpret "the latter" in a list of three items [20:14:31] Coren: I think that's the inherent problems of all things remote: You can't confirm whether there were really three accidents caused by "force majeure", or if an admin at the DC stumbled over the power cable. [20:16:52] anomie: the cuts were in different places: orlando, sarasota, tampa [20:17:03] hi, does anybody know why I cannot connect to the internet form labs any more? [20:17:32] PHP Fatal error: Uncaught exception 'Exception' with message 'Failed to connect to 2620:0:861:ed1a::1: Network is unreachable' [20:17:45] sitic: Were they all at about the same time? Or did someone just not notice until all three eventually got cut? [20:17:59] anomie: I don't know [20:20:46] benestar: I can connect for example from tools-login to www.yahoo.com. [20:21:10] Hmm... is this not the second time in a year the fiber got cut ^.^ [20:21:29] benestar: Direct connections between the datacenters are still down [20:21:45] well, how long will it take to fix? [20:22:12] Depends on how long it takes the fiber people to go out and fix the cuts. [20:29:21] YuviPanda: Have you tried getting grrrit back up? [20:29:28] is labs back up? [20:29:33] In a way [20:29:34] Let me try [20:29:41] marktraceur: it's a continuous job, it should just come back when other infra is back [20:29:52] !log restarted grrrit-wm in hopes it'll reconnect [20:29:55] marktraceur: also remember that it requires tmpa to access eqiad - for the connection to gerrit [20:30:08] Huh, maybe it won't work then [20:30:14] marktraceur: yeah. [20:30:19] Ah well [20:30:22] marktraceur: gerrit-to-redis will be down, I suppose [20:30:25] Yeah [20:30:31] marktraceur: might as well kill it, so people don't wonder why it's there but... not working [20:30:38] At least we can stare at him lovingly now [20:31:31] YuviPanda: How do I kill it? [20:31:45] marktraceur: jstop gerrit-wm? [20:31:46] or [20:31:50] jstop lolrrit-wm? [20:31:56] marktraceur: ctrl-r stop? [20:32:02] Yeah [20:32:04] jstop [20:35:28] marktraceur: sweet :) [20:37:22] bastion2 overloaded? [20:37:43] or something [20:45:14] Hi! Anyone here? [20:45:21] I think that X!'s tools are down [20:45:30] !help [20:45:30] !documentation for labs !wm-bot for bot [20:45:33] nope. [20:45:38] not specifically [20:45:46] Technical_13: You sure? [20:45:47] Read the topic STATUS [20:45:54] Oh [20:46:19] it's not an X!'s issue, it's a fiber issue affecting much of labs. [20:46:43] k thanks [21:06:31] We have gotten no new ETA from the people who own the fibres. [21:06:48] <^d> Dear lord this is annoying :\ [21:06:53] what is? [21:07:55] <^d> Well, just bemoaning the issue and lack of an ETA. [21:08:03] <^d> I guess it's not actively annoying me. [21:08:45] ^d: Well, it's frustrating but understandable; until they actually have people trying to splice glass and get photons flowing, it's really hard to estimate how bad the damage is. [21:09:23] <^d> Yeah. Just hate when things are out of your hands and there's nothing you can do but wait patiently. [21:09:32] <^d> In case you haven't noticed, I'm not a patient person ;-) [21:15:42] err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. [21:15:50] on a new instance [21:15:58] is it related to the outage? [21:16:20] hmpf http://tools.wmflabs.org/catscan2/catscan2.php [21:16:34] MaxSem: Usually it goes away after a few Puppet runs. [21:16:40] (IIRC.) [21:28:42] <^d> Ryan_Lane: You aroundddd? [21:36:40] is there a temporary workaround to reach gerrit from labs? [21:37:04] <^d> Ouch, hadn't thought of that. [21:37:21] <^d> Couldn't that publicly route though? [21:38:05] ^d, you mean gerrit? [21:38:11] <^d> Yeah [21:38:29] I know too little about the way the routing is set up to tell [21:38:37] <^d> Likewise. [21:38:49] I can ping bast1001 [21:38:58] could try some tunneling [22:27:03] If anyone is interested: Connections from Labs to replicas and Wikipedia are back. [22:27:58] \o/ [22:28:03] O_O [22:31:03] but my connection to labs died [22:31:07] :/ [22:31:10] same here [22:31:15] nothing works now from my end [22:32:06] *Argl* It's down again. [22:32:16] lol [22:32:23] kinda up ish here [22:32:38] Damianz: From Labs to en.wikipedia.org? [22:33:09] Nah [22:33:29] Db access is working dandy though, bots moved on from no reverts due to scoring to no reverts due to api access [22:33:47] ok, NL -> labs seems up again [22:35:25] Now replicas work for me, but en.wikipedia.org still times out. Probably someone pressing two fibers manually against each other :-). [22:36:10] and no connection to labs for me again... [22:37:19] Someone is reading the light out the end of one bit of fiber and shining a flash light down the other end vaugly in time [22:56:34] Oh, my bot is back to work! [23:04:13] * aude cheers :) [23:17:34] Coren: I lost my backscroll… wikitech working Ok today? [23:20:19] :-) [23:21:15] uhoh [23:21:40] * Damianz pets andrewbogott [23:21:59] So I take it some disaster ensued while I was asleep? [23:22:48] Someone let the rabbits loose in the datacenter [23:38:18] andrewbogott: The link between tool labs and the DB was hosed. Many bot herders were understandably concerned. [23:38:58] Hasteur: Fixed now? And any idea as to the cause? [23:40:41] Ah here at last is the outage report, just hadn't made it that far in my email [23:40:47] i'm making some tweaks to hhvm, and compiling it takes >1hr on my laptop. Is there any problem with me using a labs instance already booted for my team do compile? [23:41:01] * ebernhardson has a 1.8Ghz dual core ULV laptop [23:42:29] ebernhardson: As long as it's not on toollabs that's totally fine. [23:42:40] It may not be any faster than your laptop though [23:43:40] the instance claims 4 cores, which should have alot more oomph than my ULV processor, but yea i suppose will just find out [23:53:01] looks like i should build from /tmp, to not include glusterfs? is there a slightly less likely to disapear location i can use thats still local disk? [23:53:49] ebernhardson: You're root? /mnt should be free. [23:54:47] scfc_de: that'll do, thanks