[00:00:01] I don't think so, the grid sheduler should place it on a node where there is enough memory for all programms. That's why where is a memory limit, so that the server never runs out of memory [00:01:21] (if there is currently no server for you with enough free memory it will wait in the queue until there is) [00:20:42] Attempts to use catscan2 (http://tools.wmflabs.org/catscan2/catscan2.php ...) are throwing error 500 [00:27:43] sitic: thanks again, time to sleep a little bit :-) [00:45:48] Coren: you around? [02:32:39] Betacommand: Am now. [02:33:28] Coren: I got an email from cron: http://pastebin.com/4RwD6HDD [02:35:19] Odd. From what I can parse of the error message, it looks like it couldn't find an executable name in your command line at all. Perhaps an unclosed option that gobbled it? [02:36:07] Coren: its the same command thats been running for months [02:36:15] o_O? [02:36:34] Coren: Ive not changed my cron in at least two months [02:36:53] Hm... Only thing I can see is if the `which $exe` failed to execute for some external reason (OOM, etc). [02:37:16] Did you get that email just the once? What time? [02:38:16] 0000UTC [02:39:22] other jobs work fine [02:44:59] Coren: right now something seems to be crumbled with da webs. tools time out e.g. https://tools.wmflabs.org/guc/index.php?user=Nixred [02:51:46] Seems like the tools web frontends can't connect to databases. From console it works fine [02:53:32] Coren: and while we are at error reporting ... [02:54:42] /var/log/syslog stopped at 06:45 in pmtpa and eqiad with rsyslogd HUPed [02:55:45] and tools-login (pmtpa) The last Puppet run was at Sat Mar 1 03:22:30 UTC 2014 (2846 minutes ago) [02:58:57] forget about the /var/log/syslog thing .. [03:02:26] The puppet thing is not an error. [03:05:08] webserver-02 was OOM. [03:05:22] Is soooo pleased there is no shared apache in eqiad. [03:06:12] yeah yeah yeah [03:07:16] Coren: another thing I can't explain: I can connect to replicas from eqiad with hostname: labsdb1003.eqiad.wmnet [03:07:18] hedonil: Let's just say that not everyone is as careful with resources as might be hoped in a shared infrastructure. [03:07:44] hedonil: That's normal. No DB replicas yet there. [03:08:06] But the natting and happy hostnames aren't yet in place. [03:08:35] Coren: well I /can/ make a connection, but I see different databases... [03:09:04] So you "can" connect, if you know the underlying hosts *and* ports -- but you really shouldn't because that mapping is not only not guaranteed to remain valid but will in fact explicitly change in a couple of days. [03:09:36] hedonil: That's because the sharding doesn't work the way you think it does, and is one of the reasons you really shouldn't be trying to use the infrastructure names. [03:10:42] Coren: a hacker hacks - for good. ;) in particular I can see /all/ databases, except my beloved one. but it still lives somewhere [03:11:04] hedonil: No, it's also there; you just haven't guessed the right host/port combo yet. :-) [03:11:18] ahh [03:11:23] hedonil: But quit it. If you write any code relying on that it *will* break. :-) [03:11:40] (We're renumbering the databases this week) [03:11:46] Coren: It was just for the lulz [03:14:56] Maybe I'll find it tuesday - back from the other dimension, alive and kickin' [03:22:47] ok, X's tools editcounter works now, seems on apache. newweb tools still suffer [03:51:20] Coren: ok. web things are back in track. thx [05:58:15] andrewbogott_afk, Coren: poke [05:58:29] I'm unable to log in to unicorn.wmflabs.org [06:00:02] i get this: [06:00:03] Creating directory '/home/bharris'. [06:00:03] Unable to create and initialize directory '/home/bharris'. [06:00:16] and then a bunch of other stuff about packages and restarts, and then it locks me out. [06:00:24] just shuts down the connection. [06:12:43] Perhaps migration-related? Though that isn't supposed to start until Tuesday: http://lists.wikimedia.org/pipermail/labs-l/2014-February/002152.html [06:12:55] Could just be NFS stupidity. [06:14:59] I'm not going to speculate; I'm going to raise the issue to those who will know. [06:15:13] You go girl. [06:43:52] Could someone clear the mysql connections for p50380g50489 on enwiki.labsdb? Currently can't login [08:01:30] Damianz: What specific login command are you trying and what's the error message? [12:34:52] Coren: scfc_de: tools webproxy is flapping and apaches serve 500 [12:35:25] hedonil: is this new from now? [12:35:26] Reasonator has problems as a result as well [12:36:23] matanya: I just saw it right now [12:40:53] I can't log into tools-webproxy, so I'll reboot it. [12:42:17] !log tools tools-webproxy: Rebooted; wasn't accessible by ssh. [12:42:20] Logged the message, Master [12:43:08] how long does a reboot take ? [12:43:15] tools-webproxy is back up, but tools.wmflabs.org still down. Let's see. [12:44:30] That's odd: "service apache2 restart" => "No apache MPM package installed". [12:46:13] Oh, I'm an idiot. I was on tools-login. [12:46:30] (But why is Apache installed there?) [12:47:00] is that not a secondary question [12:47:12] primary is what does it take to get things going again [12:47:57] GerardM-: Thanks for your helpful advice. [12:50:15] !log tools tools-webserver-03: Rebooted, scripts were timing out [12:50:16] Logged the message, Master [12:50:42] hedonil: Try some tools that were failing before? [12:51:08] sorry ... but given that the service of labs is rebooting this regularly, bot runs abort and restart without restart options [12:51:30] scfc_de: tried https://tools.wmflabs.org/xtools/pcount/index.php?lang=de&wiki=wikipedia&name=StefanServos [12:51:47] scfc_de: tried https://tools.wmflabs.org/wikiviewstats/?locale=de&type=daystats [12:51:56] I have found that replication does not work well and it takes a lot of time to get that debugged [12:52:16] .. so sorry again but frustration tells [12:54:44] !log tools tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory. [12:54:46] Logged the message, Master [12:54:56] GerardM-: What replication does not work? [12:55:22] what has given issues in the past is the replication of Wikidata [12:55:47] What was the error? [12:56:00] http://tools.wmflabs.org/reasonator/?&q=Q8860958 this should show results by now [12:56:50] It does for me?! [12:57:12] so you get a list with four five townships ? [12:57:44] i do not [13:00:04] No, I get "instance of Wikimedia:category page" and "is a list of township type ...". Is this outdated? [13:01:25] That seems to mirror https://www.wikidata.org/wiki/Q8860958 quite closely. [13:02:34] scfc_de: ok. seems to work now. thx. [13:03:50] scfc_de when there is replicated data, you will get to see a list [13:04:28] GerardM-: Can you provide an example? [13:04:50] http://tools.wmflabs.org/reasonator/?&q=6486661 shows 2518 items [13:08:00] http://tools.wmflabs.org/reasonator/?&q=111235 shows a number of townships that are in boone county [13:11:21] Is this tool maintained by Magnus? Have you reported this to him? [13:15:34] scfc_de: could you add User:Henning_Snater and User:Tobias_Gritschacher to tool labs please? :) [13:18:40] scfc_de as long as the environment is stable it works reliably [13:19:31] given that labs is not stable the question is very much how much is the restart properly automated (as is suggested by the use of tools like puppet) [13:19:48] that is where my frustration comes from [13:20:08] when the replication restarts properly, Reasonator will pick up the data [13:22:46] GerardM-: Have you reported that to Magnus? [13:23:15] regularly and as far as he is concerned the tool works well [13:23:21] it is the restart where the issue is [13:24:40] If Magnus as the tool author considers it working well, I as someone, who isn't familiar with it, won't disagree with him. [13:24:49] addshore: One moment, please. [13:24:57] :) cheers! [13:29:39] addshore: Done. [13:29:47] ty! :) [13:30:14] scfc_de: gremlin seems to be back. eg.. https://tools.wmflabs.org/guc/index.php?user=Bergfalke2 -> time out [13:40:11] quiad status -- Grid: Working -- Replicated DBs: Working -- Local DBs: Not yet -- Web: Not yet [13:40:19] eqiad* status -- Grid: Working -- Replicated DBs: Working -- Local DBs: Not yet -- Web: Not yet [13:41:29] For those who want to experiment on the eqiad labs: tools-login-eqiad.wmflabs.org [13:43:49] Coren: a bit slower than tampa for me [13:44:01] Coren what does that mean ? [13:44:05] ... "slower"? [13:44:42] GerardM-: Don't worry about it if you don't get it; this is like "pre-alpha preview". :-) [13:44:49] i type a command, and the shell return is slower than the same command on tampa [13:45:11] hedonil: "time out" = screen appears, but only a growing number of "IIIIIIIIIIIIII"s? [13:45:37] That's odd, because it's blindingly fast for me. Can you dpaste a traceroute to both tampa and ashburn from where you are? [13:45:38] scfc_de: yep. and after ages proxy timeout message [13:46:00] coren ... I want to understand what the status is of the replication of the Wikidata database on labs [13:46:18] as far as I can see it is either not working, or updating [13:47:38] scfc_de: and here no response at all https://tools.wmflabs.org/wikiviewstats/?lang=de&project=&page=Gravity_%28Film%29&datefrom=0000-00-00 [13:47:44] GerardM-: select rev_timestamp from revision order by rev_timestamp desc limit 1; [13:47:44] GerardM-: 20140303134706 <-- up to the second. [13:48:38] hedonil: The apaches are currently breaking because there are too many tools consuming too many resources. I'd add a webserver were it not for the fact that migration begins tomorrow so it's kinda pointless. [13:49:17] GerardM-: Replication is working fine, and has no lag. What issue do you see? [13:50:03] this should return a list of townships http://tools.wmflabs.org/reasonator/?&q=Q8860958 [13:50:29] hedonil: Coren might be right about guc, wikiviewstats uses lighttpd/webservice so I don't know why that's failing. [13:50:44] GerardM-: Whatever issues there may be with the tool, it isn't replication. [13:50:58] Coren: ignore, i'm an idiot, it is the otherway around, eqaid is 20 ms faster [13:50:59] this is data from a few days ago.. http://tools.wmflabs.org/reasonator/?&q=6486661 [13:51:05] it does show its items [13:51:13] it is tampa which is slower [13:52:57] !petan-build [13:52:58] make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-custom [13:53:17] @replag [13:53:19] GerardM-: Perhaps reasonator has some internal processing done at interval and that hasn't been done in some time? The database itself is running happily with subsecond lag. I might be able to help debug if I knew what it looked for in the database; but as far as I can tell if there is and issue it's with the tool. [13:53:29] matanya: Ah. Makes more sense. :-) [13:53:33] scfc_de: I did a webservice restart, works now (for the next minutes or so ..) [13:54:00] sorry for the noise Coren [13:54:11] hedonil: Have you considered that the webservice might simply be running out of slots? [13:54:26] hedonil: And timing out because of this? [13:54:45] hedonil: (depending on what it does how) [13:55:37] Coren: I don't know. the issues started this morning. no changes in the application [13:56:26] hedonil: Is there an unusual level of activity on it? Some tools behave poorly when being crawled by a bot for instance. [13:56:54] Coren: not that I'm aware of [13:58:31] hedonil: Take a look at the access log? At worse, it'll give you an idea of what's going on. Also, I don't know what it does while it shows the progress bar; but it might be an ajax-y call timing out? What's in the error log? [14:00:26] Coren: there was a coincidence with recurrent apache OOM's and webproxy outages.. no paranormal activity in the log. [14:00:50] hedonil: If it's a webservice, it doesn't even get anywhere /near/ the apaches. [14:01:35] So unless that tool somehow interacts with another (like call an api, etc) that shouldn't be relate. [14:03:34] Coren: by default it's a simple php app. but let's see if it happens again [14:04:37] Coren: another question: are there any recommendations for volume /public/backups in eqiad? naming conventions? [14:05:33] hedonil: That's going to be where the automated backups store their stuff; it's not intended to be (directly) writable by endusers. [14:07:43] woo! backups again :) [14:07:47] Coren: you mentionened a volume/dedicated place for custom db backups [14:08:12] Coren: or custom file backups [14:08:24] hedonil: Yes, but they're not done by doing them yourself; they're done by scheduling them. [14:09:34] That's still in flux anyways; that won't be done by migration. I don't want to add moving parts to an already complicated process. [14:10:01] Coren: so what to do if you want to have a custom (scheduled) backup? [14:13:19] if you are still planing a solution for that, I'll wait and keep it as is [14:57:34] * hedonil notifies that there are new replica credentials for tools accounts and grants for existing databases have to be changed by hand [15:00:40] Coren: are these new replica credentials for tools accounts additional or a scheduled replacements? [15:01:26] hedonil: Right now, they are additional; the old ones will remain valid until pmtpa is decomissioned. [15:01:46] ack [15:02:12] Yeah, one of the three unavoidable noticable changes. [15:03:12] so let's put some load on the new grid ... [16:30:45] Hi, http://tools.wmflabs.org/ is responding Error 502 and many tools Error 503 or working partially [16:35:55] I´m getting proxy error when trying to load stuff... [16:36:06] is this ¨problem¨ known alreay? [16:36:10] already* [16:36:15] Coren: ^ [16:36:35] Wiki13: this is where the ops in charge of labs hangout as well as most labs users :] [16:36:42] ah [16:36:54] operations is mostly for production wikis [16:37:22] didnt know, now I know :) [16:39:15] this is the log when doing a request: [17:37:04.325] GET http://tools.wmflabs.org/ [HTTP/1.1 502 Proxy Error 105205ms] [16:40:44] This means that webserver-03 is unresponsive. *argl* [16:41:16] :( [16:41:42] the first byte timeout should probably be set to a lower to fails earlier [16:41:52] User:Luxo in this channel? [16:42:57] reasonator works, autolist answers queries with claims from wikidata but not lists from wikipedia categories,catscan returns 500 [16:44:43] !log tools tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc. [16:44:44] Logged the message, Master [17:18:20] Why isn't guc using the webservice anyways? [17:18:31] No matter; there is no apache default in eqiad. [17:27:58] !log deployment-prep doing an Elasticsearch reindex on beta before I try another one in production [17:28:00] Logged the message, Master [17:38:48] Coren: I can't log into unicorn.wmflabs.org anymore. The ssh connects and I get "Unable to create and initialize directory '/home/bharris'." and then it closes the connection. [17:39:03] I'm about to get on the train, though, so I can't debug until later. [17:39:25] jorm: Level of emergency? Because I'm in a severe crunch for tomorrow's migration. [17:50:22] Coren: medium. I was hoping to get some user tests running today. [17:50:39] I'll try to try a look at it then. [17:54:55] can someone creata me a home directory on integration-selenium-driver.pmtpa.wmflabs? [17:55:06] I have access to the machine but no home directory in the project, I believe [17:55:13] my ssh connection are getting thrown on the floor [17:55:47] manybubbles: you're the second to report missing home directory today [17:56:00] sorry! I don't think I ever had it there [17:56:21] no reason to be sorry, just telling you jorm had similar issue [17:56:26] and i think Coren is looking [17:56:36] might be general issue with mounting the /home's [17:56:54] and usually those get created automatically [17:56:58] Worst. Timing. Ever. [17:57:14] It's gluster, obviously. [17:57:20] nod [17:57:39] What project is this? [17:58:49] manybubbles: [17:59:01] beta? integration? [17:59:09] Coren: me, uh, I'm not sure? [17:59:15] whatever one runs jenkins [17:59:28] :p [17:59:43] don't you have to be project member [17:59:48] to connect to that [18:00:09] tried to search for instance name, but not yet [18:00:17] That's integration. Lemme check [18:00:53] Yeah, broken gluster. [18:02:01] Volume restarted. [18:02:28] how do you do that? init script? [18:06:06] coren: did you get my question re jsub environment variable? [18:06:40] giftpflanze: Not that I recall. Remind me it? [18:06:55] mutante: No, you have to go on the gluster bricks, kill processes, then forcibly restart the volume. [18:07:08] jsub environment variable actually says it all [18:07:19] Says nothing to me. [18:07:29] Coren: it let me in! thanks so much! [18:07:39] i wanted to ask if you could add one to your script [18:07:58] so that it read from it [18:08:00] giftpflanze: I probably can. Definitely not today, or likely this week. Please file a BZ. [18:08:11] *reads [18:08:28] Coren: gotcha, thx [18:08:29] so, you approve that? [18:09:25] giftpflanze: I don't disaprove of it, and I have 0% brain cycles to spend on the matter atm. :-) [18:09:37] ok ^^ [18:09:49] yay, w/e [18:09:53] Sorry about being short, but migration is tomorrow. :-) [18:10:21] i see :) [18:27:41] Should the files on https://wikitech.wikimedia.org/wiki/Special:NewFiles not have a license like on all other wiki projects? [18:29:54] MGA73: yes, and i think they are cc-by-sa [18:30:07] note how the footer links to https://wikitech.wikimedia.org/w/skins/common/images/cc-by-sa.png [18:34:57] Yeah but on other project we normally put a template on the files :-) [18:46:57] it seems like I can not use the commonswiki.image table... is this intended? [18:51:40] Just set up a new tool lab project. Any time I try to load a PHP file over the web, I just get a 500 internal server error: http://tools.wmflabs.org/svg-map-maker/test.php [18:52:28] kaldari: can I see the code? [18:52:38] [18:53:20] If I wanted to run a number of independent operations on certain edits in the recent changes feed, is there any "good" way of doing so? [18:53:27] kaldari: weird, I'm using php on my projects [18:53:29] fale: any PHP file give the 500 error, HTML pages work fine though [18:53:54] petan: ^ [18:54:06] hi [18:54:22] any idea why I can't get PHP to work in a new project on tool labs? [18:54:25] kaldari: did you check your error log? are the files owned by tool? [18:54:40] files MUST be owned by tool [18:54:45] petan: ah, that's probably it. I forgot to do that! [18:54:56] ok [18:55:09] I wish there was better error than just 500 [18:55:17] but I didn't implement this... [18:55:28] it's very cryptic error [18:55:34] petan: any idea why I can not use the table commonswiki.image? [18:55:50] no :( I have absolutely no powers over database [18:56:06] petan: hmm, still doesn't seem to work, and no error log :( [18:56:12] petan: oh :o. Who has power over the db? [18:56:20] kaldari: tell me which file [18:56:22] fale: Coren|Busy [18:56:22] maybe I need to take the whole dir... [18:56:34] petan: thanks :) [18:56:43] kaldari: just tell me the file name I check it [18:56:50] full path if possible :P [18:57:15] hmm, looks like all the ownership is fine... [18:57:42] I can't check the error without file name [18:57:42] data/project/svg-map-maker/public_html/test.php [19:00:08] fale: Have you tried commonswiki_p? ("_p") [19:01:21] kaldari: unfortunately it's hosted on webserver-01 and for unknown reasons I can't ssh there [19:01:32] maybe the box is broken or Coren|Busy is doing something nasty there [19:01:54] kaldari: if you want I can switch it to another webserver? [19:02:11] petan: sure [19:02:18] if it's not a hassle [19:03:06] scfc_de: same error :( [19:03:12] !log tools petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that [19:03:14] Logged the message, Master [19:03:17] petan: I think you need to reboot tools-webserver-01, it probably has lost /public/keys [19:03:25] fale: What's the exact query? [19:03:55] scfc_de: that isn't anything that requires reboot, anyone with root key (which Coren|Busy just don't want to give out to other admins for whatever reason) can fix it [19:03:59] I told him that many times [19:04:10] scfc_de: select * from commonswiki.image limit 1; [19:04:31] you can easily set up multiple keys to root, you don't need to strictly use only the one shared on all of labs [19:04:47] scfc_de: it seems like I was connected to the wrong server [19:04:49] but he always say "no you don't need direct access to root account you just don't need it" [19:04:56] so I can't fix it. [19:05:07] petan: There's a difference between being right and getting things done with the tools at hand :-) [19:05:14] petan: I am most certainly not going to give you root to WMF operations; and also I am not going to spend time debugging resources problems on a webserver that will live all of two weeks. [19:05:23] Coren|Busy: I never asked for that [19:05:36] Coren|Busy: I suggested to upload more public keys to authorized_keys file [19:05:44] fale: To which server are you connecting? commonswiki.labsdb? [19:05:47] scfc_de: I mean, I was connected to itwiki.labsdb instead of commons.labsdb... this make me wondering... does this means I can not join between the itwiki db and the commons one? [19:05:50] Coren|Busy: that isn't giving any access to any other box that these few [19:05:56] * than [19:06:29] fale: You can, but it's not documented at the moment. On itwiki.labsdb, you can join against commonswiki_f_p (or was it commonswiki_p_f? One of them). [19:06:39] kaldari: [error] [client 10.4.1.89] (12)Cannot allocate memory: couldn't create child process: /usr/lib/suphp/suphp for /data/project/svg-map-maker/public_html/test.php [19:06:40] petan: Doesn't work this way, and also I don't have time to argue about it. [19:06:45] scfc_de: cool, thanks :) [19:07:15] !log tools petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free [19:07:16] Logged the message, Master [19:07:39] kaldari: fixed! [19:07:44] scfc_de: commonswiki_p_f is denied and commonswiki_f_p is hanging... :D [19:07:46] kaldari: your web works again, including many others [19:07:57] petan: Yay! How'd you fix it? [19:08:14] kaldari: I restarted apache server... :P nothing really sophisticated [19:08:26] it was kinda borked heh [19:08:28] petan: genius! [19:08:51] fale: You need to do a JOIN on the primary key of commonswiki_f_p.image; without that, it would be slow beyond believe :-). [19:09:01] scfc_de: I see :) thanks [19:14:05] petan, Coren|Busy: 500 error is back :( http://tools.wmflabs.org/svg-map-maker/test.php [19:14:48] hold on [19:15:11] now it's getting horrible segfaults [19:15:14] :o [19:15:24] Internal Sever Error on tools ... [19:15:26] ? [19:15:35] yes [19:15:36] for at least a few hours [19:15:40] I can see [19:15:48] !log tools rebooting webserver-01 which is totally dead [19:15:51] Logged the message, Master [19:17:50] !log tools petrb: upgrading all packages on webserver-02 [19:17:51] Logged the message, Master [19:17:56] petan: Remember to restore the iptables; bit me several times. [19:18:16] scfc_de: any documentation for that? [19:18:18] !toolsadmin [19:18:18] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin [19:19:00] scfc_de: this query is hanging: http://pastebin.com/uFikeSQ5 ... Probably something is wrong about that table :( [19:19:19] petan: Dunno; "sudo iptables-restore /data/project/.system/iptables.conf" [19:19:21] wow, scfc_de, webserver-02 is seriously fucked up [19:19:28] scfc_de: I am getting failed to fork on almost every command [19:19:37] scfc_de aha ok [19:20:03] !log tools petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade [19:20:04] Logged the message, Master [19:20:53] ok, it works again :o [19:21:03] fale: Looking into it. [19:21:08] scfc_de: thanks :) [19:21:32] wtf [19:21:41] I can't even ssh to tools-login nor tools-dev [19:21:45] getting permission denied [19:22:05] scfc_de: does new ssh login to any of these works to you? direct, not through bastion [19:22:23] works again [19:22:25] o.O [19:27:17] fale: itwiki_p.image has 126509 rows, so I expect this might take a while. [19:28:13] scfc_de: is the commonswiki_f_p db real or is something using the lan? [19:34:03] fale: In MySQL, it's called "federated tables". That means the MySQL server at itwiki.labsdb connects to commonswiki.labsdb and on access to the commonswiki_f_p table queries the original over the network. So "real", but over the network. [19:34:30] scfc_de: I see, thankyou :) [19:35:14] scfc_de: I still have the impression that something is not working properly, but I'll wait more minutes to see if something happens [19:36:36] speaking of root : https://bugzilla.wikimedia.org/show_bug.cgi?id=62132 root owns my homedir ;-) [19:37:26] zz_yuvipanda: seems like the gerrit<->github bridge has some issues with the pull-requests ;) [19:38:47] MGA73: Yours is apparently an instance of https://bugzilla.wikimedia.org/show_bug.cgi?id=61899; let me fix that for you. [19:39:29] scfc_de: it worked... in "only" 12 minutes :D [19:39:35] thank you [19:39:42] MGA73: Could you try logging in? May fail, just first step. [19:39:47] scfc_de: thanks a lot :) [19:41:50] scfc_de: Logged in but the start message says that -bash: /home/mga73/.bash_profile: Permission denied [19:42:06] kaldari: should be fixed now? [19:42:22] yep, thanks! [19:43:27] MGA73: And try again, please? [19:45:20] scfc_de: No error message this time :-) Will see if I can set up things now [19:50:09] scfc_de: Thank you... it works now... Except for a crappy user-config.py I copy-pasted [20:08:40] eqiad status -- Grid: Working -- Replicated DBs: Working -- Local DBs: Not yet -- Web: Working [20:09:22] Coren|Busy: is it possible to create instances on equiad now? [20:09:35] petan: Instances? Not for users, no. [20:09:44] mhm [20:09:54] @notify andrewbogott_afk [20:09:55] This user is now online in #wikimedia-labs. I'll let you know when they show some activity (talk, etc.) [20:10:05] he said it's gonna be possible this week or something [20:11:07] Yes, he did. Tuseday. Read the email. :-) [20:19:03] from gerrit server: ! [remote rejected] master -> master (invalid committer) [20:19:29] what is that? I was trying to merge a pull-request that came from outside gerrit [20:28:48] hey lovely labs folks [20:28:57] terrrydactyl just joined the wikimetrics project [20:29:01] and she needs loginviashell [20:29:10] could someone please give her that? [20:29:47] fale: I think (and have no clue :-)) that you need to amend the commiter to your account (i. e. author of the commit remains the original, but commiter becomes you). Could you try "git commit --amend --no-edit" and see if that changes the commiter to yourself? [20:29:59] milimetric: Moment, please. [20:30:41] milimetric: 8ohit.dua? [20:30:57] scfc_de: I'm not sure what you mean [20:31:03] what's 8ohit.dua? [20:31:24] A user on wikitech. What's terrrydactyl's wikitech username? [20:31:59] i was going to say please https://wikitech.wikimedia.org/wiki/Special:FormEdit/Shell_Access_Request [20:32:04] oh, it's terrrydactyl [20:32:09] scfc_de: ^ [20:32:25] same as her IRC [20:32:27] scfc_de: thanks :) [20:33:51] milimetric: Done. [20:34:08] thank you so much scfc_de [20:44:09] * Damianz pokes scfc_de [20:51:12] hedonil: Stress test? [20:51:33] Coren|Busy: ? [20:51:40] http://tools-eqiad.wmflabs.org/?status [20:52:05] Coren|Busy: heel yeah ;) [20:52:18] Coren works fine [20:52:38] Coren|Busy: 30% faster than pmtpa [20:52:55] Moar power!!1!! [20:54:00] Coren|Busy: moar - no problem. [20:54:22] That wasn't intended to be a challenge. :-) [20:54:33] Coren|Busy: hehe :) [20:55:31] Coren|Busy: but for some hours I can raise my own job limit from 8 to hmmm 20 [20:55:43] eqiad status -- Grid: Working -- Replicated DBs: Working -- Local DBs: In progress -- Web: Working [20:56:22] Coren|Busy: Does that mean that tools that don't rely on tools-db can already migrate, or wait for your official okay? [20:56:49] zz_yuvipanda: if you can close the https://github.com/wikimedia/labs-tools-lists/pull/3 it would be great :) thank you :) [20:57:05] scfc_de: If you can cope with the lack of documentation and helper scripts, you're welcome to migrate now. [20:57:37] raised da limit - I'll put it back if other jobs appear [20:57:49] scfc_de: Indeed, if you find caveats or gotchas and note them, that'd be useful. [20:59:59] Coren|Busy: 3 questions, feel free to re when not busy - a) can you flush connections for p50380g50489 on enwiki.labsdb - cbng has been down all day b) can we proxy tools.wmf to tools-equiad (even if the answer is do it in lightty - yay to proxy->proxy->proxy->proxy c) any chance on getting 2fa taken off my wikitech account so I can reboot/move things. [21:01:39] a) yes; b) that's on my 'would be nice' list when everything else works; c) tomorrow? [21:02:34] Works for me :) [21:02:53] Coren|Busy: Is there something like https://wikitech.wikimedia.org/wiki/Labs_Eqiad_Migration_Progress for tools ? [21:03:24] hedonil: That's on my TODO for today, once I'm done with the actual workings. [21:03:31] hedonil: and wrote the helper scripts. [21:03:49] Can your helper scripts remind me in what order I need to compile things? :D [21:04:00] Damianz: Probably not. :-) [21:04:38] Coren|Busy: you're the manager, make a blank wikipage and let the folks do something constructive [21:06:38] https://wikitech.wikimedia.org/wiki/Tool_Labs/Migration_to_eqiad [21:07:01] Coren|Busy: :) :) [21:07:36] ERROR 2003 (HY000): Can't connect to MySQL server on 'enwiki.labsdb' (111) < that actually got worse [21:07:49] * Damianz goes back to finding ice cream while wikipedians spam him [21:07:58] kill user p50380g50489; [21:07:58] ERROR 2013 (HY000): Lost connection to MySQL server during query [21:08:14] I think you borked the server [21:08:19] It would seem that the DB didn't like my getting rid of those queries. [21:08:29] yay it's alive [21:08:43] tyvm cor...n [21:08:45] Nope, the server self-borked. [21:10:25] Hmm... is it behind in replication? seems to be returning a lack of data for current edits [21:10:34] though the bot might just be grumpy [21:10:57] oh it died again [21:11:12] * Damianz will come back later after Coren|Busy gives the server lots of love [22:09:33] Have anyone tried commonscat.py recently? [22:11:22] It's possible. [22:18:03] MGA73: is it broken? [22:18:43] it does not work for me but my bot is really fucked up so it may be something in my installation [22:19:13] can it be caused by labsdb issue? [22:20:20] Fails silently? [22:21:53] I'm trying to run it on the page 2031 on da-wiki and instead of removing the commonscat template (the category on commons is deleted) it gives a lot of errors [22:23:09] perhaps it is just trying to tell me it is time to go to bed :-D [22:25:42] ... what labsdb issue? [22:26:10] Coren|Busy: 14:16 < brassratgirl> actually it seems all of the labs database might be down, which I'm guessing someone already knows about :) [22:26:23] on #wikimedia-tech, i said to make bug anyways [22:26:32] but that it might be because you're working on it [22:26:38] since i saw mysql changes [22:27:09] works for me [22:27:12] Hm, no. All of the labs db seem up and healthy; the changes I'm making are for the new one in eqiad. [22:27:48] s/all of the labs db/the random sample of the labs db I just did/ [22:28:04] invites brassratgirl to this channel [22:29:10] I still think enwiki is a little 'dodgy' but has survived for like an hour now hmm [22:30:27] hey all [22:30:48] * Damianz waves a large clown like hand in the vauge direction of brassratgirl [22:30:58] eek [22:31:20] stopping by b/c I am trying to run some queries with catscan2 & am getting mysql errors [22:31:24] I got waved over here. [22:31:57] Ah. The bug seems to be in catscan2; as far as I can tell there is nothing wrong with any of the databases. [22:32:38] OK, thanks! I will report it. [22:34:26] Coren|Busy: catscan2 is giving me "Could not connect to enwiki.labsdb : Can't connect to MySQL server on 'enwiki.labsdb' (4)" [22:34:39] it's defo up atm [22:34:43] and quick intersection, which is a separate but related tool, was giving a longer mysql error [22:34:46] * Damianz looks at bot happily making edits [22:35:11] so I wasn't sure. OK. Thanks. [22:38:06] Hmm [22:38:43] I do agree that catscan2 returns 'Could not connect to enwiki.labsdb : Can't connect to MySQL server on 'enwiki.labsdb' (4)'... but it's definatly running [22:39:01] * Damianz wonders if dns is a little wonky on a server somewhere... [22:39:51] quick_intersection is giving me "Warning: mysqli::mysqli(): (HY000/2003): Can't connect to MySQL server on 'enwiki.labsdb' (110) in /data/project/magnustools/public_html/php/common.php on line 88 Fatal error: Call to a member function real_escape_string() on a non-object in /data/project/magnustools/public_html/php/common.php on line 101 " [22:42:10] totally can't login to the webservers to check and it's not running on grid... [22:42:37] Coren|Busy or scfc_de or andrewbogott_afk maybe can check if there's something funky on those [22:44:20] brassratgirl: where is that running [22:45:07] http://tools.wmflabs.org/catscan2/catscan2.php & http://tools.wmflabs.org/catscan2/quick_intersection.php, respectively. [22:45:20] catscan2 tools-webserver-02 [22:45:45] thanks damianz :) [22:46:10] lua-catscan2 tools-webserver-03 < interested in what that is [22:52:13] Coren|Busy: here's my constructive suggestion https://wikitech.wikimedia.org/wiki/Tool_Labs/Migration_to_eqiad [22:54:07] 622, nice [22:55:09] Some of them probably don't need to be moved... can see at least one that was from a tools proxy issues eddition of fixing bots [22:57:35] Ah, I see what the issue is with catscan. Someone rebooted a webserver but didn't restart the natting to the databases. [22:58:09] That should fix it. [22:58:46] oh my [22:58:56] Coren|Busy - sweet! Thanks! [22:59:25] (I'm trying to come up with a list of articles for an edit-a-thon tomorrow, so was feeling a bit rushed!) [22:59:31] lazy: dump into /etc/rc.local i suppose [23:01:11] mutante: Not worth the effort. pmtpa tool labs is going away. [23:01:25] Coren|Busy: Very certain when I rebooted tools-webserver-02 I ran iptables afterwards. [23:01:57] scfc_de: That wasn't on -02; so not your fault. Nobody's trying to assign blame. :-) [23:02:16] It might very well have been me; there were a few OOM early this morning. [23:02:59] Coren|Busy: makes sense, yea [23:03:45] I rebooted it in the afternoon (UTC, 12:54Z); not for blaming, just to prevent further mishaps. The problem with webserver-* seems to be that they complain about OOM, yet there's lots of free memory. [23:04:13] No apaches in eqiad. [23:05:09] So new mysteries to solve :-). [23:05:46] scfc_de: Nope; the lighttpd setup is well known. :-) [23:06:58] Still prefer nginx ;) [23:08:15] Damianz: for running one web server per tool, lighttpd is much much more lightweight. [23:08:59] http://www.fefe.de/fnord/ that is really leightweight:) [23:09:05] Small! (13k static Linux-x86 binary without CGI, 18k with CGI) [23:09:06] haha [23:09:14] There is that... though considering it's all proxied anyway half the time I'd rather just have stuff pointed to gunicorn and skip the proxy proxy [23:09:54] o.0 that's like smaller than mongoose and we embed mongoose into switches [23:10:02] http://www.fefe.de/fnord/others.html *g* [23:11:05] that's not really a scientific comparisant [23:11:29] this guy such brilliant [23:11:32] Heh. I still want something that speaks proper FCGI, can do things like indexes and rewrites, etc. :-) [23:11:39] what, "found buffer overflow with first grep" isnt scientific?:) [23:11:50] lighttpd: works. [23:11:56] giftpflanze: fefe? haha, like his blog, eh? [23:12:06] not really [23:12:10] hrhr [23:12:57] I wondered why he was writing in broken english... until I realised chrome had translated german [23:13:34] hedonil: https://wikitech.wikimedia.org/wiki/Tool_Labs/Migration_to_eqiad - another form to fill?! [23:13:41] Actually... since we should have new proxy by s...i forgot the nick, are we getting rid of apache entirly and using generic labs proxy solution (tm)? [23:14:46] scfc_de: is there already another? [23:15:10] * hedonil likes forms to document things [23:15:47] mass-create bugzilla? hmmm [23:16:13] hedonil: No; it just reminded me of https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/List_of_Toolserver_Tools which people never really bothered with :-). [23:16:36] Mediawiki tables are hard... no one likes wiki tables [23:16:57] scfc_de: hmm toolserver was great, but had a bad management [23:16:59] it's one bug for tool [23:17:01] per [23:18:00] mutante: And who cleans up the mess of the maintainers who don't care about Bugzilla? :-) [23:18:21] Empty tool directory & Co. = moved. [23:18:41] scfc_de: andre_, but pssst :) [23:20:39] toolserver has been flaky for years though I'm not yet convinced labs has much more priority in being fixed when broken, if not less... though with more money. which in some sense is understandable, but others just frustrating hmm [23:24:40] Damianz: labs has at least two full-time employees whose primary responsibility is keeping it up and happy [23:27:27] Who's first responsibility is seemingly and correctly production when it all goes to shit and tools also (kinda) did ;) Don't get me wrong, the right idea but omg people required maybe... [23:40:25] Coren|Busy: well, technically tool labs has one full time [23:40:55] ... well, true, though in practice that's the /only/ thing Andrew has been able to do lately. :-) [23:41:01] and you've been investing a lot of time into production stuff lately [23:41:21] andrew has been doing stuff with labs migration and not tools, right? [23:41:25] tools != labs [23:41:41] Ryan_Lane: Damianz: labs has at least two full-time employees whose primary responsibility is keeping it up and happy. [23:41:46] I didn't say "Tool Labs". :-) [23:42:08] yes, but Damianz was specifically talking about the stability of tools [23:42:10] Can we please not call it tool labs :P [23:42:16] which is different than labs [23:42:25] agreed. tools is much better [23:42:31] Ryan_Lane: Yes, though entirely dependent on it. And yes, 'tools'. :-) [23:42:39] tools == I don't care, it should just work. labs == I'm going to break production [23:42:54] tools == when it breaks lots of people spam my twitter/email/phone all day [23:42:59] * Coren|Busy still thinks nobody should let geeks -- let alone sysadmins -- name anything. :-) [23:43:07] tools is actually fine [23:43:10] it's the project name [23:43:45] anyway, the stability of tools is dependent on the stability of labs, but it's also dependent on its own stability [23:43:58] and tools itself only has one person assigned [23:44:36] (labs as a whole has historically been under-staffed) [23:44:39] I'm still sad the boss decided 'Panda' was a silly name for a GUI and named it 'Squig'. Naming things is hard [23:44:51] (and its scope continuously expands) [23:44:58] "i don't care, it should just work" applies to $everything:) [23:45:22] yep. part of actually making things work is having people work on it [23:45:30] and alas, that hasn't seemed to have changed [23:45:31] equiad status -- Grid: Working -- Replicated DBs: Working -- Local DBs: Working -- Web: Working [23:45:39] Coren|Busy: JFI: tools-exec-07 is reported with only 2 NCPU (instead of 4) in qhost and has therefore currently a load of 117% [23:46:07] For reference to an interesting point - how long has the kit been in eqiad to move to? I know it's been talked about for a looooong time [23:46:20] hedonil: How fun. It really /does/ only have two cores. I effed up when I set it up. [23:47:56] * Coren|Busy changes the actual availability numbers so that -07 won't get overcommited [23:48:30] Damianz: until relatively recently it hasn't been very actively worked on [23:48:41] I think work began in late november/early december [23:49:04] and really the work didn't gear up until january [23:49:09] Coren|Busy: I could kill my jobs on -07 [23:49:58] hedonil: CPU usage a bit over 100% isn't an issue as a rule of thumb, especially since your jobs are CPU bound. [23:50:31] Ah - I thought there was stuff waiting to be configured for a while... but I can't keep track of what's been in for test the stolen for other projects [23:50:32] Coren|Busy: I mean if you want to replace it with another image before the strom begins [23:51:56] hedonil: *shrug* It doesn't actually matter; it's half the size and will get half the jobs; since it's virtualized hardware on static images, there isn't any real difference between resizing that one and adding another. [23:51:58] Damianz: well, part of the setback was me leaving [23:52:47] hedonil: This close to opening the floodgates, I'd rather concentrate on writing the documentation than mess with it. :-) [23:53:00] Damianz: we had some other issues as well. in this specific case non of it was related to manpower being stolen for other projects [23:53:19] that said, manpower has been very often stolen from labs [23:53:38] we've hired people for labs to have them taken one week before they started for something completely unrelated [23:53:46] and never get them back [23:56:07] Mhm [23:56:23] it's amazing we have a working product ;) [23:56:32] indeed [23:56:51] and that en.wp is still up :p [23:57:03] I still can't believe it's like 3 years... kinda remember back before Cor.n and andrew in the randomly unstable days of labs... which still worked pretty well [23:57:10] what is the "epmd" daemon doing [23:57:16] to me that is a rap group [23:57:16] especially considering that production always takes priority and priority of stuff in labs itself is often meddled with by managers [23:57:31] and scope creep is standard operating procedure [23:57:45] Though I kinda also think it's a shame it took nearly 2 to get replicas and stuff and hasn't really finished some of the 'basics' really.... so much more could be done [23:57:47] Damianz: ah, the bad ole days of gluster [23:58:09] oh gluster... [23:58:21] it's still being used. I just got it to a point of relative stability [23:58:43] There's only one thing worse than gluster... upgrading gluster [23:58:52] Damianz: well, the lack of replicas and such was mostly that it took so long to hire someone for tools [23:59:04] mostly because a position wasn't opened for ages [23:59:47] Coren|Busy knocked out the work related to tools in relative speed, once he was on board