[00:27:55] !log deployment-prep deployment-mediawiki01 & deployment-mediawiki02: configured for project-local puppet & salt masters [00:27:57] Logged the message, Master [00:40:25] bd808: what do i need to do to make those two instances trebuchet-deployable? i migrated both (thanks to your instructions) to the project puppetmaster and salt master [00:40:31] but /srv/deployment isn't getting created [00:40:59] hmm… it should happen when puppet runs [00:42:31] # grain-ensure contains deployment_target scap ; echo $? [00:42:31] 0 [00:42:38] so that much is working, at least [00:43:51] ori: Have you run though the things at -- https://wikitech.wikimedia.org/wiki/Trebuchet#Troubleshooting [00:44:22] bd808: i haven't; good call. [00:45:01] I think that puppet makes /srv/deployment but I may be misremembering [00:45:47] it didn't [00:45:51] "salt-call deploy.fetch 'scap/scap'" triggered it [00:46:11] i suppose that means that if i had done a deployment of scap it would have successfully created the dir [00:47:17] puppet should have triggered that though. There's something similar to `salt-call deploy.deployment_server_init` to ensure that each host it up to date [00:47:32] I don't have the repo handy to grep right now [00:48:31] You need to add the new hosts to this file too -- https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/files/dsh/group/mediawiki-installation [00:49:18] Without that, scap won't actually send them anything [01:01:51] bd808: aha, thanks [01:02:15] ori: It was bound to be your next question :) [01:02:34] bd808: actually i didn't run into it because i ran 'sync-common' on each host instead [01:02:35] which did work [01:02:55] Oh yeah. pull will work even when push doesn't [03:06:35] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~dispenser/* to Tool Labs - 10https://bugzilla.wikimedia.org/66868#c3 (10This, that and the other) 5UNCO>3RESO/WON Closing this as wontfix. The tools were on Labs briefly but were removed due to ongoing issues. I don't think Dispenser is going to... [04:04:28] :( [04:20:29] YuviPanda|zzzzzz: ok, flask-mwoauth 0.1.34 includes your logout?next= change -- thanks for contributing :-) [08:57:13] Is a line like this correct for a python script run in toolsA? conn = MySQLdb.connect(host='commonswiki.labsdb', db='commonswiki_p', read_default_file='~/replica.my.cnf') [09:03:17] no [09:03:29] read_default_file=os.path.expanduser('~/replica.my,cnf') [09:03:53] Nemo_bis: ^ [09:04:05] er, my.cnf* obviously [09:17:07] Thanks, changed. Maybe connection was working; the script OOM'ed. [09:17:43] also, you should really use oursql :) [09:20:32] But my lazyness in tweaking scripts I run can be seen in https://github.com/nemobis/wikiteam/commit/eb9677fb13fb9e782776603c83c480a826c76605 [09:22:30] !log deployment-prep rebooting deployment-apache01. [09:22:32] Logged the message, Master [09:22:41] you should be using the subprocess module for that [09:25:19] !log deployment-prep rebooting deployment-apache02 [09:25:21] Logged the message, Master [09:26:24] Sure, I did elsewhere [09:27:02] legoktm: subprocess is for wimps. use sh :P [09:27:07] the module, that is [09:27:24] YuviPanda: subprocess doesn't require additional dependencies! [09:27:29] pffft [09:27:32] borrringgg [10:17:59] !log deployment-prep setting up nutcracker on deployment-bastion. It was installed but the puppet class to configure it was not being applied. Related Gerrit patches: {{gerrit|148041}} and {{gerrit|148042}} [10:18:02] Logged the message, Master [10:19:53] !log deployment-prep deleted /var/lib/apt/lists/lock on bastion. Was prevent apt-get update from running [10:19:55] Logged the message, Master [10:20:30] !log deployment-prep upgrading packages on deployment-bastion [10:20:32] Logged the message, Master [10:22:29] Anyone has idea about "Could not find/open font (FreeSans) (width calc)" while importing templates to beta. [10:22:34] oh, hi hashar :) [10:25:43] !log deployment-prep on bastion, fixed some puppet dependency to have nutcracker to start with the proper configuration {{gerrit|148043}} [10:25:45] Logged the message, Master [12:24:50] 3Wikimedia Labs / 3Infrastructure: Implement log rotation for jstart - 10https://bugzilla.wikimedia.org/46471#c1 (10Tim Landscheidt) 5NEW>3RESO/DUP *** This bug has been marked as a duplicate of bug 66623 *** [12:24:50] 3Wikimedia Labs / 3tools: Setup an easy to use logrotate based system for rotating tools logs - 10https://bugzilla.wikimedia.org/66623#c1 (10Tim Landscheidt) *** Bug 46471 has been marked as a duplicate of this bug. *** [12:29:19] 3Wikimedia Labs / 3tools: Setup an easy to use logrotate based system for rotating tools logs - 10https://bugzilla.wikimedia.org/66623#c2 (10Tim Landscheidt) I haven't tested it, but I assume SGE only lets go of the file descriptors when the job is not running. So any log rotation would require a job restar... [13:05:24] Coren: around ? :-) [13:16:07] 3Wikimedia Labs / 3deployment-prep (beta): Improper deletion of fiwiki from beta.wmflabs.org - 10https://bugzilla.wikimedia.org/66401#c2 (10Antoine "hashar" Musso) We deleted a bunch of wikis ages ago because beta was not really capable of supporting a hundred of wiki. I guess something in CentralAuth datab... [13:16:20] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no more existing wikis - 10https://bugzilla.wikimedia.org/66401 (10Antoine "hashar" Musso) [13:22:35] 3Wikimedia Labs / 3deployment-prep (beta): Vector buttons are not translated - 10https://bugzilla.wikimedia.org/67087#c1 (10Antoine "hashar" Musso) 5NEW>3RESO/FIX I can no more reproduce the issue. This is a month old, I am assuming this was a transient error with the localization cache. [14:07:27] Coren: wanna do the 1/1 hangout now? :-] [14:07:40] about Jenkins / Openstack [14:08:16] hashar: D'oh! I overslept and forgot you. Sure. Gimme 2 mins and I'm all yours. [14:08:32] :-] [14:09:24] I'm in it now. [14:10:21] joining [15:02:57] Coren: Tweetme is still spidering [15:12:33] Coren: merci beaucoup :-] [16:16:49] hi! is it possible that the webservice cannot handle sql SELECTs with to much output? i tested a query with LIMIT and it functioned well, but when i tried without LIMIT i got "400 Bad request" and the webservice doesnt seem to function (even after starting it via putty). could anybody help me? [16:20:57] sanyi4: If you're talking about lonelylinks, the problem with the webservice is that error.log is owned by sanyi4 and cannot be written to by the tool account. "become lonelylinks" and "take error.log" should fix that, followed by "webservice restart". [16:27:16] scfc_de: that doesnt work :( [16:27:51] sanyi4: The webserver for lonelylinks is running again? [16:30:32] scfc_de: well, i got "Restarting webservice....... restarted." for "webservice restart", so i guess it should be running, but http://tools.wmflabs.org/lonelylinks is all white. [16:35:01] sanyi4: If you do "qstat" as lonelylinks, you see that the webservice is running ("r"). But something's indeed not right ... [16:36:09] yuvipanda|zzzzz: If you're awake, "curl -I http://tools.wmflabs.org/lonelylinks" says "HTTP/1.1 502 Bad Gateway" (from nginx), but no text. This should be more explicit. [16:39:22] "curl -H 'Host: tools.wmflabs.org' -I http://tools-webgrid-02:4087/lonelylinks/" gives no content as well. [16:43:02] Ouch! "curl -I" suppresses the content, idiot. [16:43:11] Okay, so the problem is in nginx. Let's see. [16:46:14] sanyi4: So the problem is that your webserver runs at tools-webgrid-02 on port 4087, while the proxy thinks it's running on tools-webgrid-04 on port 4015. Why? I have no idea. I'll try to fix that. [16:48:41] scfc_de: than maybe it was Bad gateway, not Bad request, sry. [16:49:43] sanyi4: Whatever it was, it still is, and is a problem with the proxy. I've just restarted the webservice, but the Redis DB hasn't been updated. [16:50:03] scfc_de: That's really not normal. [16:50:55] Coren: On tools-webproxy, "HGETALL prefix:lonelylinks" => "*0" ("nc -C localhost 6379"). [16:51:34] Coren: Doesn't the webservice only start if the registering of the port succeeded? [16:51:52] It does, which means that redis should have overriten the old key with the new one. [16:53:09] Let me try that manually (the syswrite). [16:56:13] Okay, from tools-webgrid-03 as tools.lonelylinks, "nc tools-webproxy 8282", ".*", RET, "tools-webgrid-03:4051", RET, connection is closed by -webproxy, HGETALL on tools-webproxy still shows empty, so the problem seems to be in the socket => Redis bit on tools-webproxy. [16:57:26] proxylistener, it's called. [16:57:56] And /var/log/proxylistener seems to have stopped 2014-07-21 14:59:03,315. [16:59:06] YuviPanda: Want to take over? Can proxylistener be safely restarted? The only thing that would be missing is that on webservice shutdown of the currently connected webservices the corresponding Redis entries weren't removed? [16:59:20] oh boy [16:59:24] is proxylistener alive? [16:59:33] * YuviPanda sshs in [17:00:13] scfc_de: no, restarting it, I think, will kill all the open sockets it has... [17:00:49] YuviPanda: Yep, but that wouldn't make nginx stop? [17:00:54] *stop working [17:01:11] no, nginx would be fine, just working on the things already in the redis database? [17:01:21] just... everything will be in an inconsistent state [17:01:26] scfc_de: can you try starting some other tool? [17:01:57] YuviPanda: Started "templatecheck". [17:02:41] scfc_de: hmm, right. [17:02:41] YuviPanda: That shouldn't affect running lighttpds, they never actually /read/ from the sockets. What'll happen is that dying webservices will have their entries stay in redis until restarted (i.e.: give 502s instead of 403s) [17:03:00] Which is suboptimal, but not catastrophic. [17:03:02] Coren: hmm, right. so I suppose... I should just restart it and see what happens? :D [17:03:24] YuviPanda: If those webservices are restarted, will they overwrite the existing entry or add another one? (Don't know about Redis data types.) [17:03:26] YuviPanda: I'd like to know /why/ it has gotten stuck first. Do you have any logging in that thing? [17:03:31] !log deployment-prep Testing scap change I40a891b via cherry-pick [17:03:34] Logged the message, Master [17:03:44] Coren: I do, and logging is just stuck from about 2h ago [17:03:50] Coren: /var/log/proxylistener [17:04:48] YuviPanda: Is there still a connection between proxylistener and Redis? Maybe that was closed. [17:04:55] YuviPanda: It's currently doing selects with half-second timeouts. [17:04:56] ah, maybe that [17:05:09] YuviPanda: So it /look/ kinda alive. [17:05:25] right [17:05:47] * Coren restarts a webservice, see if proxylistener reacts. [17:06:28] accept(4, {sa_family=AF_INET, sin_port=htons(35569), sin_addr=inet_addr("10.68.17.123")}, [16]) = 983 clone(child_stack=0x7f0c5ae2cff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f0c5ae2d9d0, tls=0x7f0c5ae2d700, child_tidptr=0x7f0c5ae2d9d0) = 22441 futex(0x27b3190, FUTEX_WAKE_PRIVATE, 1) = 1 [17:06:49] So it sees the connection, but if you have more than one thread the "do actual stuff" thread doesn't seem to do much. [17:07:20] Lemme follow the clone() [17:08:28] YuviPanda: https://tools.wmflabs.org/paste/view/bcd48340 [17:08:40] YuviPanda: Looks like the proxylistener actually tries to do the right thing. [17:09:07] YuviPanda: Look at lines 44 to 49 [17:09:32] For that matter: [17:09:36] 2014-07-21 17:07:22,572 Removed redis key prefix:csbot with key/value .*:tools-webgrid-03:4053 [17:09:37] 2014-07-21 17:07:48,335 Received request from csbot for .* to tools-webgrid-03:4053 [17:09:37] 2014-07-21 17:07:48,344 Set redis key prefix:csbot with key/value .*:tools-webgrid-03:4053 [17:09:41] look at 49, it's even writing a log entry [17:09:44] hmm [17:09:49] Coren: wat, it worked for you and didn't for us? [17:09:52] * YuviPanda is confused [17:10:13] Wait, maybe the *portgranter* is dead on one of the nodes. [17:10:18] aaah, maybe [17:11:09] * Coren checks webgrid-02 [17:12:40] !log deployment-prep Hotfix for scap ssh host key checking to fix jenkins scap job [17:12:43] Logged the message, Master [17:15:12] Looks like something odd going on with portgranter on -02; attempting to grab a port manually doesn't seem to talk to proxylistener [17:15:21] btw, http://tools.wmflabs.org/lonelylinks/ is not blank any more, but theres no webservice on. [17:17:18] Coren: But that was the same on -03; and I even connected manually with nc. [17:17:31] * Coren keeps looking into it. [17:19:23] Oh, duh, nevermind; bigbrother keeps restarting my webservice behind my back. :-) [17:20:28] Works 100% of the time from any webgrid node, as far as I can tell. Lemme try with lonelylinks. [17:20:50] 2014-07-21 17:20:43,501 Received request from lonelylinks for .* to tools-webgrid-02:4087 [17:20:51] 2014-07-21 17:20:43,502 Set redis key prefix:lonelylinks with key/value .*:tools-webgrid-02:4087 [17:21:05] o_O [17:21:10] wat [17:21:22] In the words of the great poet: "Whu?" [17:21:29] Hmmm. [17:21:46] * Coren tries with webservice [17:22:56] tools.lonelylinks@tools-login:~$ webservice start [17:22:56] [on -webproxy] [17:22:56] 2014-07-21 17:22:33,296 Received request from lonelylinks for .* to tools-webgrid-03:4051 [17:22:56] 2014-07-21 17:22:33,297 Set redis key prefix:lonelylinks with key/value .*:tools-webgrid-03:4051 [17:23:33] sanyi4: How, exactly, were you starting your webservice? [17:24:31] When I did, I did just "webservice stop" => wait, till "qstat" empty, "webservice start". Nothing in -webproxy, then. [17:24:34] Coren: I think they where *restarting* a crashed service [17:25:58] scfc_de: That's what *I* did ant it worked. [17:26:21] Betacommand: webservice restart is exactly that dumb anyways; it qdels, wait for it to be gone, then just starts it again. [17:26:57] YuviPanda: Did you restart any component on -webproxy? [17:27:07] Coren: no, didn't tocuh anything [17:27:12] * Coren boggles. [17:27:14] I was going to strace it when you just did that [17:27:15] Coren: webservice restart [17:27:21] The network load on -webproxy can't be that high to block connections, can it? [17:27:24] http://en.wikipedia.beta.wmflabs.org/ hmm [17:27:38] this doesn't look right [17:27:51] scfc_de: That box is mostly idle. [17:28:05] scfc_de: Proxying isn't all that expensive. [17:28:12] Yep. [17:28:15] Coren: not it works. [17:28:39] sanyi4: Yeah, when I was trying to figure out why it failed for you it just didn't. [17:28:42] (fail) [17:29:16] Coren: sry, i mean: now it works [17:29:20] So I guess it means that it failed to fail correctly? :-) [17:32:14] !log deployment-prep Updated scap to 4871208 (+ cherry pick of I6a56b5e) [17:32:16] Logged the message, Master [17:34:56] back to the fundamental problem: the webservice crashed when i queried a very big select - essentially i was trying to copy the wb_items_per_site table from wikidata_p to the local database. how can i do this without knocking out the system again? [17:36:19] sanyi4: Presuming the problem was memory, it was probably because your result set was too large. Often using a cursor will allow you to not have to snarf the whole result at once. [17:36:58] Coren: what do you mean? [17:37:00] (Allowing you to fetch, say, 100 or 1000 rows at a time instead of the whole thing) [17:37:37] sanyi4: I don't know what language or library you are using for DB access, but scan its documentation for cursors and fetching a number of rows at a time. [17:38:25] Coren: i thought about that, but how do i know if im at the end of table? [17:39:18] Coren: i was using php script [17:39:23] sanyi4: Normally, whatever mechanism your library provides for reading rows will return something different when you're at the end. Often, just the number of returned rows will suffice -- say you ask for 1000 rows and get 500 in return normally means you reached the end. [17:41:00] sanyi4: Are you using mysql, mysqli or some other driver? [17:41:06] Coren: Hmm, just restartet webservice newwebtest (this time a custom one). qstat says running ; page says 404 [17:41:29] hedonil: 404 normally means your lighttpd is running and returns the 404 [17:41:38] hedonil: check your error.log [17:41:44] Coren: s/404/no webservice/ [17:43:07] hedonil: Custom one? What is the invocation script you are using? [17:43:28] Coren: $ ./webstart [17:43:34] You might get a 403 because your use of portgrabber isn't quite right; I can tell if I take a peek at it. [17:44:02] * YuviPanda pokes andrewbogott with https://gerrit.wikimedia.org/r/#/c/147759/ [17:44:07] Coren: mysqli [17:44:08] Coren: Hm, I can try it with palin vanilla webservice, if it's the same thing [17:44:19] Coren: but this worked until yesterday [17:46:20] YuviPanda: Why not integrate with role::puppet::self? [17:46:32] scfc_de: because that uses operations/puppet.git [17:46:41] scfc_de: and if you fork that, you need to keep your forks rebased properly [17:46:52] hedonil: At first glance, I see no reason why your custom scripts wouldn't work. Gimme a minute to check the logs. [17:46:52] Coren: plain vanilla webservice on newwebtest now, same result -> no webservice [17:47:11] YuviPanda: Yeah, I mean, pimp up role::puppet::self so that it accepts custom Git URLs. [17:47:45] (Yes, very non-trivial, just want to know if there are technical obstacles.) [17:47:47] Heyyyyy... waitaminit. [17:48:28] Coren: You can start / stop both as you like it [17:49:11] scfc_de: this is easier? plus this means I don't have to repeat essential services, and this also means that there will be a puppet running there that'll put in changes to the labs base files, etc [17:49:52] hedonil: That looks like sanyi4's problem I didn't manage to reproduce. Yeay, intermittent issue! [17:50:03] Yeah [17:50:08] YuviPanda: k [17:50:18] oh wow [17:50:27] Oh, and ffs, ops meeting in 10 minutes. Debug under time pressure FTW [17:50:40] heh [17:51:33] proxylistener's VSZ is 5200620 = 5+ GByte? [17:52:30] Could it simply run OOM for new connections? [17:52:48] YuviPanda: Yeah, it's definitely proxylistener that is ill. [17:52:54] andrewbogott: hi, so i did more lookups and testing on the video chat things, and found a good solution [17:53:01] YuviPanda: Talking to it by hand has it intermitently fail now. [17:53:03] memory leak? [17:53:05] hmm [17:53:17] matanya: which one? [17:53:29] a desktop app called utox [17:53:30] Or the Python overhead per connection is bigger than expected? [17:53:58] Coren: you meant something like that: [17:53:59] i was working on a server solution as well, and got it to work on one-->one audio/video [17:54:14] based on jappix and turn [17:54:47] i hope the multiple-->multiple support will be released soon [17:54:57] Coren: $j=0; $k=0; while ($k==0) { $sql="SELECT * FROM wb_items_per_site LIMIT ".$j.", ".$j+1000; $query=mysqli_query($wikidata,$sql) or die(" Error: ".mysqli_error($wikidata)); $j+=1000; $k=1000; while ($row=mysqli_fetch_array($query)) { $k--; $sql2="INSERT INTO wb_items_per_site VALUES(".$row['ips_row_id'].",".$row['ips_item_id'].",'".$row['ips_site_id']."','".str_replace('\'','\'\'',$row['ips_site_page']) [17:56:12] sanyi4: That's a little more complicated than you needed if you used cursor (you'd have one select rather than several getting part of the results) but the end result is similar. [17:56:14] matanya: Our main use case involves ~12 clients. It's always hard to predict how things will scale [17:56:29] Coren: scfc_de what do we do now? [17:56:32] matanya: but, getting something to work at all is a big step! [17:56:33] andrewbogott: would you want to test utox with me ? [17:56:46] matanya: sure, but I have a meeting atm [17:56:48] YuviPanda: Ima restart proxylistener. The failiure mode is acceptable. [17:56:49] restart it by hand, and see if that fixes things, and if it does, then take a profiler to it and see if it is leaking memory? [17:56:54] Coren: +1 [17:57:02] +1 [17:57:11] hedonil: Try again? [17:57:13] andrewbogott: at your spare time. poke me whenever [17:57:21] yep! [17:57:36] * hedonil tries [17:57:46] Coren: ok, i will read about corsors now :) thanks a lot. [17:58:16] YuviPanda: The only real problem with restarting proxylistener is that it looses track of webservices so dead ones won't get 403s until they are restarted. [17:58:24] right [17:58:43] Since the actual association is in redis, those will survive. [17:58:46] YuviPanda: Does it overwrite old entries or add sub-entries? I remember something about it round-robbing? [17:59:26] scfc_de: it should override. [17:59:34] scfc_de: subentries were for domainproxy, and it was never exposed [17:59:40] Custom webservice running (qstat) -> no webservice [17:59:59] hedonil: proxylistener failed to restart. Hang tight. [18:00:21] ugh [18:00:23] YuviPanda: It fails to start but doesn't log why. [18:01:07] Coren: addr already in use [18:01:16] it logs to stdout, let me add a patch to make exceptions go to the log [18:01:19] YuviPanda: Bleh. Didn't turn SO_REUSEADDR on? [18:01:24] no [18:01:25] sadly [18:01:35] tsk, tsk. [18:02:06] hedonil: This time, it held. Plz to try again? [18:02:15] * hedonil tries again [18:02:34] YuviPanda: The restart worked now; I expect the last lingering sockets died. [18:02:43] yeah [18:02:48] I'll fix the logging and REUSEADDR [18:03:59] YuviPanda: Re-read the code and it looks as if proxylistener would indeed /replace/ the old values for the same destination which should be a constant ".*". [18:04:10] (So everything's fine.) [18:04:11] hedonil: I'm getting 500s now, apparently from your lighty. [18:04:12] indeed, it should replace them [18:04:13] yeah [18:04:19] Coren: yep: http://tools.wmflabs.org/newwebtest/perl.fcgi [18:04:19] the urlproxy never had round robin functionality [18:04:49] * Coren rushes to his meeting [18:08:21] hey Coren / andrewbogott - either of you around for a bit of debugging? (instance access issues) [18:08:41] JohnLewis: We're in a meeting atm; maybe in ~30m [18:09:00] Coren: Oh I forgot about that - yeah sure. Poke me when you're available :) [18:23:32] Coren: resolved with some brute 'how about this?' from greg-g :) [18:23:50] Coren: proxycommand fix :) [18:23:58] which I forgot :p [18:27:01] Diff between Redis and grid: diff -u <(ssh tools-login.wmflabs.org 'qstat -q \*webgrid\* -u \* -s r -xml' | sed -ne 's/^ tools\.\(.*\)<\/JB_owner>$/\1/p;' | sort) <(ssh tools-webproxy.eqiad.wmflabs "echo KEYS prefix:* | nc -Cq 3 localhost 6379" | tr -d \\r | sed -ne 's/^prefix://p;' | sort) [18:28:18] Wow, bookmanagerv2 has seven lighttpds running. [18:28:29] whu? [18:30:57] have a look at http://tools.wmflabs.org/lonelylinks/test_query.php what am i getting wrong with cursors? [18:32:47] Coren: I assume the second one was started manually, and subsequent calls of webservice got a hiccup of two jobs with that name existing. I'll stop all and start one again. [18:33:09] * YuviPanda pokes Coren and andrewbogott with https://gerrit.wikimedia.org/r/#/c/147759/ [18:33:46] !log tools Stopped eight (!) webservices of tools.bookmanagerv2 and started one again [18:33:49] Logged the message, Master [18:38:53] !log tools Restarted webservice for stewardbots because it wasn't in Redis [18:38:55] Logged the message, Master [18:39:32] !log tools Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa [18:39:34] Logged the message, Master [18:39:49] sanyi4: the thing Coren was mentioning, is fetching a resultset row by row ( mysqli::use_result() will fetch the rows one by one. ) [18:40:33] sanyi4: but if you experience promblems with resultset and OOM you should read this: [18:40:35] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help/Performance [18:42:34] hedonil: That's a very nice, very useful document. Is it linked from the main help document? [18:43:08] Coren: not yet. [18:46:59] Coren: It's now time to bring the custom lighty's back to standard and join bigbrother ;) [18:47:18] Coren: but this'll need some (minor) modifications to https://git.wikimedia.org/blob/operations%2Fpuppet/aaec26d3ee382667a9ce2d8820c60ef6895f07cf/modules%2Ftoollabs%2Ffiles%2Flighttpd-starter [18:48:09] hedonil: Do tell. [18:48:19] Coren: what is your thought about requiring UA's to connect to labs? [18:48:20] .. and some additions to bigbrother, taking timeouts (hanging services) into account [18:48:35] Coren: is bigbrother on puppet? [18:48:35] * hedonil writes & explains [18:48:57] YuviPanda: Yes. [18:49:01] cool [18:49:04] YuviPanda: Yeah, it's in modules/toollabs/files [18:49:05] where? :D [18:49:09] ah, wasn't looking there. [18:49:10] cool [18:49:17] ugh, /me hits self on forehead [18:49:21] Coren: and tweetme bot is still crawling [18:49:36] Coren: scfc_de btw, it's public news that I'm moving to the ops team in a few months :) [18:49:44] Betacommand: Hm. More IP ranges not previously being used I guess. I'll take a looksie through the logs shortly. [18:50:11] YuviPanda: Yeah, I had a discussion about that w/ Mark. You're being earmarked for labs even. :-P [18:50:22] Coren: Betacommand it should be not too hard to add UA blocking to the proxy, and we can manage the list of UAs in puppet [18:50:24] easily [18:50:26] Coren: :D yeah [18:50:33] YuviPanda: Yippie! :-) [18:50:43] YuviPanda: That'll teach ya to be useful! [18:50:43] Coren: who would've thought :) [18:50:46] heh [18:51:27] YuviPanda: I'm not complaining, with CI likely to need support and labs resources the extra help will be needed. [18:51:35] yup [18:52:10] Betacommand: You mean block connections without UAs? I'm okay with that; that's general policy for prod too. [18:52:22] Coren: I can focus on tools + labs projects themselves, at least to star twith [18:52:54] Coren: tracking ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=68300 [18:52:54] brb, lunch [18:53:55] Coren: yeah [18:54:42] Coren: Im a PITA and actually review my access.log :P [18:55:42] YuviPanda: Ill hold your toes to the fire and make sure you do that UA stuff :P [18:55:53] Betacommand: heh :) [18:56:09] Betacommand: let me block things without UA [19:22:44] andrewbogott: what is "subject area narrow or broad" in the project documentation form supposed to mean? [19:37:23] andrewbogott: Coren can you add MatmaRex to bastion project? [19:37:28] * MatmaRex waves [19:37:30] apparently he isn't in there, despite having had tools access.. [19:38:38] YuviPanda: looking... [19:39:30] gifti: it's supposed to mean whether the project is for a single thing or for a bunch of things (e.g. the 'testlabs' project which ops use for any random test node) [19:39:38] But, the name is definitely unclear... [19:39:39] ah, thx [19:42:54] MatmaRex: done [19:43:34] thanks! [19:43:37] (works) [19:48:32] andrewbogott: https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProxy&action=create&project=design®ion=eqiad is giving me a blank page again [19:48:33] oom? [19:51:23] YuviPanda: seems specific to that project… I don't know what's happening yet [19:51:34] hmm bleh [19:55:22] I think it's because an instance is in an in-between state; it shows up in the list but doesn't have a 'host' entry [20:00:48] YuviPanda: better? [20:03:52] andrewbogott: yeah [20:07:32] andrewbogott: ty [20:07:39] andrewbogott: think you'll have time to look at puppetception today? [20:07:48] Yeah, I should [20:08:58] andrewbogott: cool :) [20:09:15] andrewbogott: once it gets merged, I'll add variables, etc with a role that hsould show up i n the wikitech project [20:21:25] andrewbogott: convention over configuration :) [20:21:59] YuviPanda: well, there are two questions here... [20:22:08] Will any random 3rd party repo use a site.pp? Probably. [20:22:15] But, what will define the current node in that site.pp? [20:22:21] node definitions? [20:22:27] same as what we have in prod? [20:22:43] So you're assuming that the user will have a local patch to site.pp that names the labs instance? [20:22:48] That's ok, I guess [20:23:24] andrewbogott: assume I start a project called 'paprika', and I'll have a paprika-puppet git repo that will have puppet config for my project. In that, I have a site.pp that has node definitions for the ones I want [20:23:29] so no need to have a local patch [20:24:06] That's right, assuming that paprika-puppet is /your/ repo and not from a 3rd party [20:24:12] indeed [20:24:21] if it's a random 3rd party repo, I'll have to fork it and modify it for my needs [20:24:37] that's the underlying assumption, yeah. no support for any random 3rd party puppet repo [20:26:32] andrewbogott: w000t [20:27:20] andrewbogott: is there docs on how to get text boxes to appear on wikitech configure instance page? [20:28:02] Probably, but it's easy, https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [20:28:28] ah cool [20:28:35] andrewbogott: so I just write a role, and have global variables? [20:28:41] and then just add them to the page [20:28:43] right [20:28:47] cool [20:44:35] !log integration created integration-slave1004-trusty a trusty instance unsurprisingly [20:44:37] Logged the message, Master [20:44:38] Krinkle: ^^ :D [20:54:43] sometime not performing? [20:54:45] 2014-07-21 20:53:39: (mod_fastcgi.c.3001) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 79 [21:08:14] !log integration Switching integration-slave1004-trusty to its own puppetmaster [21:08:17] Logged the message, Master [21:18:44] !log deployment-prep Restarting upd2log-mw on deployment-bastion. There is a bunch of [python] processes [21:18:47] Logged the message, Master [21:19:21] akoopal: this is quite normal for your current configuration [21:19:48] akoopal: btw. your tools is getting pretty spidered UA "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1; 360Spider" [21:21:01] !log deployment-prep Running update.php for commonswiki in screen [21:21:03] Logged the message, Master [21:21:08] !log deployment-prep Running update.php for cawiki in screen [21:21:10] Logged the message, Master [21:21:13] !log deployment-prep Running update.php for eswiki in screen [21:21:15] Logged the message, Master [21:21:49] 3Wikimedia Labs / 3tools: [tracking] Block spider / web crawler on tool labs - 10https://bugzilla.wikimedia.org/68300#c2 (10metatron) - 360Spider User Agent: "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1; 360Spider" [21:21:53] hedonil: hmmm [21:22:13] !log deployment-prep Running update.php for hewiki in screen [21:22:15] Logged the message, Master [21:23:26] !log deployment-prep Running update.php for simplewiki in screen [21:23:28] Logged the message, Master [21:25:12] ohh [21:25:16] it is failing for all of them ? [21:25:22] bd808: ^^ [21:25:50] hashar: Yeah. I have 5 updates running in screen so far [21:25:51] !log etherpad clearning up and deleting project [21:25:52] Logged the message, dummy [21:26:02] bd808: kill them, lets change the jenkins job timeout [21:26:10] I though it was only a couple wikis failling [21:26:24] hashar: Ok. I'll kill mine [21:26:38] let me share my screen in the hangout hehe [21:27:46] !log deployment-prep Killed update.php jobs; Antoine will give jobs a longer timeout [21:27:48] Logged the message, Master [21:28:46] bd808: yeah i just manually removed the timeout from the job [21:28:51] using the Jenkins graphical interface [21:28:54] * bd808 nods [21:29:06] someone suggested that in one channel [21:29:09] that is smarter :D [21:29:31] andrewbogott: I filed a bug about Extension:OAuth on WikiTech... [21:29:53] thanks bryan [21:30:20] YuviPanda: I think it is installed, though? [21:30:27] andrewbogott: no;pe [21:30:38] I… see it in extensions/OAuth [21:30:39] on virt1000 [21:30:43] https://wikitech.wikimedia.org/wiki/Special:Version [21:30:46] not there [21:31:19] andrewbogott: maybe not included in LocalSettings? [21:31:39] yeah, that's probably right [21:31:45] YuviPanda: [21:31:51] right. can you add it? :D [21:32:01] If you add it to the wikitech vagrant role, then… that will provide me with a nice example [21:32:02] :) [21:32:20] andrewbogott: hehe, adding it to vagrant will be a totally different process, since I'll just setup a requires => :) [21:32:29] true :( [21:32:32] OK, I'll have a look [21:32:38] andrewbogott: btw, not sure if you saw - my ops move is confirmed (and public) :) [21:32:45] +17 to oauth on wikitech and a way to use in via apache configs [21:32:58] YuviPanda: yeah, I did -- good news! [21:33:09] adding OAuth probably requires a deployment window &c. [21:33:13] But I'll see what's involved [21:33:31] andrewbogott: cool. I'm developing something that'll require that [21:33:40] (https://meta.wikimedia.org/wiki/Research:Ideas/Public_query_interface_for_Labs) [21:34:22] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=61754 Ryan mentioned setting up openid in the way long ago [21:34:29] akoopal: you may add the following lines to your .lighhtpd.conf & restart webservice to ban evil spiders [21:34:32] # deny access for baidu, Yahoo, yandex, tweetmemebot [21:34:33] $HTTP["useragent"] =~ "(Baiduspider|Yahoo! Slurp|yandex\.com/bots|TweetmemeBot|CCBot|scrapy\.org|Sogou web spider|SeznamBot|SputnikBot|kinshoobot|YisouSpider|360Spider)" { [21:34:33] url.access-deny = ( "" ) [21:34:33] } [21:34:44] woo [21:34:47] bd808: heh, yeah. but I guess that's a while off [21:35:47] bd808: puppetception got merged, btw :) [21:35:57] I saw that. Pretty cool [21:37:15] hedonil: I assume I can just create that file. Done [21:37:15] akoopal: better for pasting : https://tools.wmflabs.org/paste/view/9ebec5b1 [21:38:21] hmm, one space missed out :-) [21:38:26] akoopal: yep. if it doesn't exist, just create .lighttpd.conf <- right spelling &paste the lines [21:38:34] this still shows up in logs? [21:38:59] gifti: yes it shows up in logs but as 403 forbidden [21:39:15] so it doesn't eat up much resources [21:42:58] hedonil: thanks for the hint, and now bedtime, night :-) [21:43:08] if you have a browser addon installed like "User Agent Switcher", you can easily check if it works [21:43:15] akoopal: yw [21:44:33] andrewbogott: btw, there's a bug for it at https://bugzilla.wikimedia.org/show_bug.cgi?id=68305 [21:45:00] scfc_de: have you seen https://meta.wikimedia.org/wiki/Research:Ideas/Public_query_interface_for_Labs [21:45:21] andrewbogott: found 5 minutes for me ? [21:45:47] matanya: sure [21:47:49] YuviPanda: Yep! Interesting idea; might also hold off those who otherwise would install phpmyadmin world-accessible :-). [21:47:58] scfc_de: :) [21:48:04] scfc_de: yeah [21:48:15] scfc_de: I've a UI only prototype running at http://yuvi.in:5000/ [21:48:24] (Login doesn't work yet since wikitech has ano OAuth) [21:49:17] scfc_de: need to find a proper name :) [21:50:13] "Because SQL doesn't need SSH" :-). [21:50:53] scfc_de: heh :D a bit long for a name [21:51:51] YuviPanda: SQLExplorer [21:52:03] hedonil: too generic [21:52:23] YuviPanda: Yuvi's fancy SQL browser [21:53:10] SQLEverywhere [21:53:25] heh [21:53:30] it's not browsing anything [21:54:06] SQL on-the-fly [21:54:18] heh [21:54:26] or The one that must not be named [21:55:24] YuviPanda: adn don't forget a query killer for that [21:55:35] hedonil: indeed, if you look at the meta page, an aggressive query killer is a must have [21:55:39] hedonil: don't want to DDoS LabsDB [21:56:04] hehe, there are so many others yet ;-) [21:57:13] hedonil: :D Current plan is to 1. cap total number of queries that can be run through the service 2. per user caps 3. max time limit aggressive (10m? lesser?) 4. logging [22:00:28] I think the restriction that users must have a wikitech account will limit DDoS quite a bit. [22:00:49] scfc_de: indeed. [22:01:06] YuviPanda: discussing data4all? :p [22:01:06] who will use it then? [22:01:47] gifti: people who find it easy enough to sign up for wikitech, but not to setup public/private keys and ssh tunnesl? [22:01:50] gifti: primarily academia [22:01:55] and casual researchers [22:01:57] ah [22:02:18] but they speak sql? [22:02:29] indeed [22:02:43] a lot of them write a lot of code to read from the dumps instead of just using labsdb [22:03:30] JohnLewis: it's going to be called 'Quarry' :) [22:03:52] I prefer data4all though :( but fair enough [22:04:16] heh [22:04:30] YuviPanda: about? [22:04:44] chasemp: https://meta.wikimedia.org/wiki/Research:Ideas/Public_query_interface_for_Labs [22:05:15] no clue why you linked me that :) [22:05:34] chasemp: oh, damn [22:05:36] chasemp: yes, I'm here [22:05:41] chasemp: I thought you asked me what we were talking about :D [22:05:54] heh, nope, what was the bug where we resolved the diamond logs crap? [22:06:06] ppl finding orphaned logs still wanting to point them to it [22:06:18] chasemp: ah, unsure. scfc_de do you have that bug handy? [22:08:19] chasemp: https://bugzilla.wikimedia.org/show_bug.cgi?id=66458 [22:18:18] Coren: can we get the graphite host to be trusty from the start? [22:18:27] since everything else is moving slowly, might as well start from start... [22:18:38] I'll test on a labs instance and make sure it works properly [22:19:39] andrewbogott: can I get you to create a couple of repos for me? :) [22:20:18] YuviPanda: I'll try, although I traditionally screw that up [22:20:34] andrewbogott: heh :D analytics/quarry/web, analytics/quarry/puppet for now [22:20:58] YuviPanda: thanks [22:21:01] chasemp: yw [22:21:33] YuviPanda: rights inherit from? [22:21:39] And, do you want an initial commit or are you importing? [22:21:56] andrewbogott: I'll import for /web, would be nice to have an initial commit for puppet [22:22:07] andrewbogott: rights inherit from, hmm, the mediawiki/core group? [22:23:07] YuviPanda: I also need descriptions [22:23:31] andrewbogott: ah, 'Web frontend for the Labs Quarry tool', and 'Puppet config for the Labs Quarry tool' [22:23:51] Query? [22:23:56] Quarry? [22:24:00] andrewbogott: Quarry, yeah [22:24:19] ok, the web tool is created, give it a try? [22:24:23] s/tool/repo/ [22:24:49] doing [22:25:30] * YuviPanda tries pushing [22:25:46] andrewbogott: woot, done :D [22:26:57] andrewbogott: create the puppet one? [22:26:58] ok -- the other one is done now too [22:27:18] w00t [22:27:19] andrewbogott: ty [22:38:02] YuviPanda: Yeah, Trusty indeed. [22:38:20] Coren: cool. I'll leave you as well to decide the disk layout [22:42:07] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349 (10Greg Grossmeier) 3NEW p:3Unprio s:3critic a:3None See https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/2741/console Ones co... [22:42:20] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349 (10Greg Grossmeier) p:5Unprio>3High [22:43:44] andrewbogott: can I also get a labs project named quarry? [22:46:50] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c1 (10Bawolff (Brian Wolff)) To clarify is this every time or just a specific update? If the schema is already up to date, it should finish within seconds (especial... [22:49:20] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c2 (10John F. Lewis) The previous job was aborted by hashar and prior to that it failed on enwiki. So the past 2 runs failed for reference. [22:51:09] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c3 (10Greg Grossmeier) To be explicit: this causes browser tests to fail because the database is in read-only mode (ie: no edits can be made). [22:52:20] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c4 (10Greg Grossmeier) (In reply to Bawolff (Brian Wolff) from comment #1) > To clarify is this every time or just a specific update? > > If the schema is already u... [22:55:37] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c5 (10Bawolff (Brian Wolff)) For reference, according to https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=e... [22:56:53] YuviPanda: https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry <- please click on the 'Edit Documentation' link :) [22:57:01] Um… Add [22:57:50] andrewbogott: cool [22:59:34] YuviPanda: Note for future reference: when we add grid nodes we have to remember to give them public IPs so that identd work and IRC doesn't hate bots. [22:59:42] Coren: aaah, cool [22:59:53] We forgot to do so when adding the new batch. [23:00:14] could that be automated? [23:00:31] andrewbogott: rebooting after a stupid mistake, let me do that when it comes back [23:01:08] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c6 (10Bawolff (Brian Wolff)) > > We can't have the Beta Cluser throwing database locked errors for the entire > day. At first glance, I don't see any reason why t... [23:01:28] gifti: Not really; it's just a step to remember when creating the instance. [23:01:35] well [23:05:21] btw for those who care in here: The Beta Cluser not only has it's database locked due to a schema upgrade, but! ori will be attempting to switch it to hhvm right now. [23:05:39] greg-g: don't forget precise -> trusty [23:05:51] ;) [23:05:52] oh yeah, that too [23:05:54] :P [23:05:58] What could go wrong?tm [23:06:03] Erm. [23:06:03] Coren: beat me to it [23:06:06] greg-g: so were getting a broken cluster and breaking it more? sweet :p [23:06:08] ™ [23:06:10] yep [23:06:32] since it isn't useful anyways, let's push through to get us to a better state sooner [23:06:37] is my reasoning [23:06:37] Coren: That's ironic :p [23:06:46] bd808: i merged a conf change to wmf-config 10 mins ago -- how long will it take to reach the beta cluster? [23:07:20] ori: a while. the code update job hasn't ran since the db job started [23:07:38] and that job's been going on a good hour and a half-ish now [23:07:42] oh, i see [23:07:48] any estimate as to how much longer it'll take? [23:07:54] ori: Jenkins is backed up for beta -- https://integration.wikimedia.org/ci/ [23:08:23] lots of non-voting jobs to run [23:08:37] ori: anywhere between 1 second to a good few hours? depends if we let the job run or what [23:08:49] * ori nods. [23:08:53] greg-g: I think I'll just wait, then. [23:09:37] ori: The schema change that you merged for Aaron this morning is taking forever to apply. [23:09:37] The config change that I'm waiting for won't actually flip the switch; it just removes various HHVM-specific workarounds [23:09:37] bd808: I won't merge changes again [23:09:37] oh, bah [23:09:37] :( [23:09:59] ori: you can skip jenkins then update things on deployment-bastion directly. It's pretty much just like updating tin. [23:11:41] poor beta https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/ [23:11:47] bd808: ah, cool [23:11:52] https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/load-statistics [23:11:53] ori: Instead of running `scap` you run `wmf-beta-scap` and you will need to `sudo -u mwdeploy` before doing the git pulls [23:14:08] ori: Or maybe even easier -- `sudo -H -u mwdeploy /usr/local/bin/wmf-beta-autoupdate.py --verbose` to update the MW checkout [23:25:26] bd808: very useful! [23:25:28] thanks [23:26:40] Somebody should make a page on wikitech that describes all the manual steps to replicate what Jenkins does for beta... [23:26:47] ...not it [23:27:35] 3Wikimedia Labs / 3deployment-prep (beta): The current db schema change upgrade is taking far too long - 10https://bugzilla.wikimedia.org/68349#c7 (10Bawolff (Brian Wolff)) (In reply to Greg Grossmeier from comment #3) > To be explicit: this causes browser tests to fail because the database is in > read-only... [23:32:24] greg-g: OK, going for it. Bryan's tips worked. [23:34:44] mwalker: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: $service_ips["ocg"] is :undef, not a hash or array at /etc/puppet/modules/lvs/manifests/configuration.pp:843 on node i-00000083.eqiad.wmflabs [23:35:15] ack; thats an unexpected production bug [23:35:20] that's in labs [23:36:05] i committed a local revert for beta for now [23:36:18] *nods* that's probably the best immediate fix [23:36:27] I'll have to replicate what gage did in production for labs [23:37:00] though... there might be something else going on; because we have other services in production that are not in labs that have lvs entries [23:37:34] and hmm... ocg instances do exist in labs, so the fact that the variable is not being populated is interesting [23:37:49] !log deployment-prep Switched over beta cluster app servers to HHVM [23:37:51] Logged the message, Master [23:37:55] ^ greg-g [23:37:56] http://en.wikipedia.beta.wmflabs.org/wiki/Special:Version [23:38:32] bbiaf [23:38:38] I can send an e-mail about it, greg-g [23:39:32] ori: Coooool!