[08:41:33] (03PS1) 10Nemo bis: Move LiquidThreads back to main channel [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142991 [08:42:25] (03CR) 10Yuvipanda: [C: 032 V: 032] "LQT has maintainers? ;)" [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142991 (owner: 10Nemo bis) [08:45:04] (03PS1) 10Nemo bis: Restrict ContentTranslation to #mediawiki-i18n [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142992 [08:53:43] (03PS2) 10Nemo bis: Restrict ContentTranslation to #mediawiki-i18n [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142992 [09:05:40] YuviPanda: now it merges :) [10:43:35] hello, anyone there to answer a puppet question on labs? [10:44:43] nuria@nuria-worker1:~$ puppet agent -tv [10:44:43] I get an error like 'Could not request certificate: getaddrinfo: Name or service not known' when trying to run puppet agent [11:09:44] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054436 edit summary: /* Zeitplan */ kl. Aktualisierungen [11:13:30] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054438 edit summary: /* Schedule */ small updates [12:17:09] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054459 edit summary: /* Schedule */ [12:17:40] Cyberpower678, John F. Lewis, and Quentinv57: sulinfo is down (webservice crashed yesterday for many users, unrelated issue, but needs manual start) [12:17:45] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054460 edit summary: /* Schedule */ [12:19:00] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054461 edit summary: /* Schedule */ [12:21:00] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054468 edit summary: /* Zeitplan */ [12:21:27] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054470 edit summary: /* Zeitplan */ [13:09:32] 3Wikimedia Labs: Mail notifications from fab.wmflabs.org delivered only days later (or not at all?) - 10https://bugzilla.wikimedia.org/65861#c18 (10Marc A. Pelletier) 5REOP>3RESO/FIX Same logfile. I've nuked it. [14:15:29] * enhydra slowpoked over all the migration brouhaha -_- [14:15:52] will it be possible to edit redirects after today? [14:16:10] (I mean my .htaccess on Toolserver) [14:19:05] 3Wikimedia Labs / 3wikitech-interface: Enable HSTS (HTTP Strict Transport Security) on Wikitech - 10https://bugzilla.wikimedia.org/67303 (10fn84b) 3UNCO p:3Unprio s:3normal a:3None Wikitech requires HTTPS connections (http://wikitech.wikimedia.org redirects to https://wikitech.wikimedia.org), so coul... [14:57:49] Aaaauugh! [14:58:00] * Coren beats up mysql. [14:59:34] enhydra: just mail tsadmins [15:06:44] enhydra: Yes, valhallasw is right - nosy and amette will create missing redirects from tomorrow as you won't be able to log in anymore. [15:07:05] nice, thank you [15:13:53] R.I.P Toolserver and all the tools that did not make the cut :'( [15:16:06] Coren: hey [15:16:49] YuviPanda: Heyo. I'm wrangling tables atm, but I'll be ready to do that node addition in ~30m. Okay with you? [15:17:16] Coren: yeah :) [15:33:55] /var/run/lighttpd/audetools.conf~: Permission denied [15:33:58] /var/run/lighttpd/audetools.conf~: Permission denied [15:34:08] bah /var/run/lighttpd/audetools.conf~: Permission denied [15:34:18] gah, scroll [15:41:06] webservice status Your webservice is scheduled: queue instance "webgrid-lighttpd@tools-webgrid-02.eqiad.wmflabs" dropped because it is temporarily not available [15:41:24] Does it mean that I have to wait for the webservice to start? If yes, is there and estimate? [15:44:23] Guest57626: It means that it couldn't start on tools-webgrid-02 at that moment, though it might start on -01 very quickly. [15:44:38] andrewbogott: Do you think we could work on https://bugzilla.wikimedia.org/show_bug.cgi?id=65591 sometime this week? I think all the hard parts would belong to me (fixing beta scap if the shell change breaks it). [15:44:57] But yeah, I'm about to add a new node to have more room for more services -- the ts going down has given us a big uptake over the past couple days. [15:45:22] Speaking of, YuviPanda, you want to be the one doing it while I guide you? [15:45:46] bd808: I'm happy to help, but don't immediately understand… (I don't know much about how deployment works) [15:46:39] andrewbogott: Mostly I'd like the labs ldap for the mwdeploy user to match up with the prod settings from puppet. [15:47:00] Ah, so just changing the 'shell' field in the ldap entry? [15:47:11] yes. [15:47:49] The possible breakage would be that I used the mwdeploy user to do ssh connections between hosts in beta, so I may have to tweak my ssh automation script [15:49:00] Coren: yea [15:49:17] bd808: do you want me to go ahead and change it, and then you can see what breaks? [15:49:46] andrewbogott: Sure. I have time this morning to deal with the possible fallout [15:50:53] so, in ldap, user mwdeploy currently has loginShell: /usr/local/bin/sillyshell [15:51:02] which is different from what the bug says... [15:52:10] Huh. Setting the shell to /bin/bash made puppet happy. I wonder if there is local settings on the beta hosts? [15:52:18] I'll look quickly. [15:53:08] Coren: in the meantime, +2 https://gerrit.wikimedia.org/r/#/c/142208/? :) is already on the general proxy [15:54:18] andrewbogott: There is no local /etc/password entry for mwdeploy on deployment-bastion or deployment-apache01. I don't know why puppet would be happy with shell=>/bin/bash if that doesn't match ldap. [15:56:54] bd808: are we talking about prod or beta now? In prod I don't think ldap is much involved. [15:57:23] andrewbogott: I'm talking about beta. [15:59:21] bd808: hm, I'm not totally sure what happens if puppet tries to create a user who already exists in ldap. [16:00:06] andrewbogott: It seems to check the info from ldap and if it matches it's happy to move on. We have done this for several other users and groups. [16:00:41] If the info doesn't match then puppet will error on the user resource [16:01:59] bd808: ok, but, to reconfirm… you want me to switch it to 'false' in ldap, not to bash, right? [16:03:33] andrewbogott: /bin/false please, yes. [16:04:08] That will make it match the define in mediawiki::users [16:04:17] ok! Stay tuned... [16:05:01] bd808: ok, done [16:05:32] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c3 (10Andrew Bogott) In ldap, user mwdeploy now has loginShell: /bin/false [16:05:48] * bd808 checks to see if puppet is appropriately sad now [16:08:23] YuviPanda: you were working on monitoring for tool labs at some point, right? [16:08:27] Hm.. webservices aren't starting [16:08:39] Existing ones are running fine, btu I can't seem to start any [16:08:40] valhallasw: I've graphite.wmflabs.org setup. [16:08:44] https://tools.wmflabs.org/wikiinfo/? [16:08:48] valhallasw: announcement soon. [16:09:02] Krinkle: might be just out of resources. I'm in the process of adding a new node as we speak. [16:09:11] Tried to start it a couple times now. cli status says it's running. grid status says it doesn't exist. [16:09:29] YuviPanda: oh, cool. gerrit reviewer bot was down again for a week or so without me noticing :-( [16:09:29] Krinkle: it might also be stuck in a restart loop for some reason? [16:09:36] valhallasw: awgh, damn. [16:09:45] valhallasw: yeah, we can't really have reliability without monitoring, no [16:09:46] OK, well I'm stuck in a 300 layer migration night mare. I can't deal with this as well now. [16:09:53] last minute toolserver migrations [16:10:01] Hi! Excuse me, if I want to run my bot on wmf labs (via tools, obviously) as "screen"... how can I do it? [16:11:26] andrewbogott: Apparently puppet for the beta apaches is pretty messed up at the moment, so it's going to take me a bit to untangle problems from production changes and finish testing. I'll ping you or c.oren later if I get all other problems fixed and still have issues from the mwdeploy shell change. [16:11:43] andrewbogott: Thanks for your help [16:12:01] bd808: ok. Ori has been working on Apache puppet code lately, so he might have some quick answers for the puppet failures [16:12:26] YuviPanda: oh, but I see graphite needs an extra tool to do notifications [16:12:27] Krinkle: Hurry up... ;) [16:13:13] valhallasw: yeah, cabotapp or icinga for that [16:13:31] Yeah. The issues I'm seeing right off the bat are caused by things that needed to be cleaned up in production but are in conflict with hacks that I put in beta :( [16:13:40] Silke_WMDE1: Easier said than done. https://github.com/Krinkle/ts-krinkle-misc https://github.com/search?q=user%3AKrinkle+tool&type=Repositories&ref=searchresults [16:13:41] valhallasw: currently working on grafana for prettier graphs before I tackle notifications [16:13:50] YuviPanda: yeah, but icinga needs root access to add reporters iirc [16:14:04] There's so much material. I've been migrating bits every other day since November. Mostly blocked by random things. [16:14:30] Some of the dependencies only came available a few days ago. [16:14:50] and tools seems to be having difficulties every other week. Nothing structural but just bad luck. [16:15:11] Every time I sit down to do something, in a few hours I run into an outage or a thing not working... [16:16:46] Krinkle: Part of it is the sudden inrush of new users and tools; this is why we're expanding labs right now. [16:16:48] Krinkle: thumbs crossed! [16:17:31] Silke_WMDE1: shutdown is tomorrow right? not at midnight? [16:17:51] also accounting for different timezones ideally [16:17:58] I've stupid question... How can I run a script also when I'm disconnect to wmflabs-tools? There's something like "screen" or I've to use crontab..? :/ [16:18:02] it's at tonight at 1:00 UTC [16:18:40] Gnumarcoo: see https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Submitting.2C_managing_and_scheduling_jobs_on_the_grid [16:19:12] YuviPanda: uh, thanks a lot! [16:19:17] Gnumarcoo: yw [16:21:05] Nemo_bis: thanks for that idea, it didn't come to my mind. [16:21:44] Silke_WMDE1: I feel two weeks notice time + backups only until the end of august is a bit short on time, to be honest, especially given the summer holidays. [16:24:21] valhallasw: I see. [16:25:27] valhallasw: We wanted to limit it because we can't follow that page forever. And we'll still have to decommission the actual hardware at some point. [16:25:52] Right. How much data is there? Would it be feasable to have it backed up in it's entirety somewhere? [16:25:53] Silke_WMDE1: glad you liked it [16:25:58] 'become ' [16:26:02] sudo: unable to mkdir /var/lib/sudo/: No space left on device [16:26:13] tools-login [16:26:20] /dev/vda2 1.9G 1.9G 0 100% /var [16:26:21] gurr [16:26:21] Help? [16:26:22] valhallasw: You'll have to ask nosy. [16:26:27] Silke_WMDE1: ok [16:26:34] Krinkle: looking now [16:26:49] valhallasw: we can't tar.gz everything from the toolserver on the toolserver. [16:27:06] (which is somehow clear) [16:27:11] hmhm [16:27:32] Krinkle: valhallasw try again? [16:28:46] $ become krinkle-redirect [16:28:46] sudo: sorry, a password is required to run sudo [16:28:48] Still weird [16:28:51] ? [16:29:12] I can become 'wikiinfo' now, though its webservice also won't start. [16:29:34] Krinkle: did you log out and in again? [16:29:43] 'webservice status' should return a job id, but instead outputs an empty string followed by one of the continuous grid servers not being available [16:30:04] valhallasw: worked, now I can become krinkle-redirect [16:30:21] Coren: ^ seems webgrid-02 is dead again? [16:30:45] YuviPanda: Pedal faster, bring those new nodes in! :-) [16:30:46] queue instance "continuous@tools-exec-01.eqiad.wmflabs" dropped because it is temporarily not available [16:30:51] Coren: :) [16:31:07] But no, it seems up and running. [16:31:18] They're just full. [16:31:19] Coren: gridmaster seems to think otherwise... [16:31:21] ah [16:31:29] nice time to run into that :) [16:32:02] I didn't expect there would be that many toolserver users waiting for the very last couple days to all move at once. [16:32:17] So we got a serious bump in the number of users and tools. [16:32:40] Coren: Can you tell how many? [16:32:50] Krinkle: We're out of room on the web nodes; but we're (literally) in the middle of adding more. [16:32:59] Is there anything I can do to have the webservice start? Since it's near impossible to debug and develop locally (due to labsdb etc.) I basically can't do anything right now [16:32:59] OK.. [16:33:27] Krinkle: fwiw, you *can* access labsdb locally via an ssh tunnel fairly simply. [16:33:51] Silke_WMDE: Not precisely, since I can't tell the reason for account creation, etc. But a quick eyeball estimate makes me think about 15% sudden increase in the number of tools over the past 2-3 days. [16:34:03] hehe [16:34:18] YuviPanda: Hm.. I tried that but it didn't work [16:34:22] each slice? [16:34:56] I can't have them all on the same port [16:35:07] Or can I somehow forward *.labsdb resolution to there? [16:35:44] Krinkle: I did that by copy pasting /etc/hosts from tools-login to my local machine, IIRC [16:36:18] Those local IPs don't resolve locally [16:36:37] I only need sX.labsdb, but still [16:36:39] Krinkle: hmm, I'll check up in a while. bringing up the new webgrid node right now, sorry [16:36:42] k [16:36:45] but I do distinctly remember having them work [16:40:15] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~para/cgi-bin/kmlexport to Tool Labs - 10https://bugzilla.wikimedia.org/61540#c1 (10Thorsten) Seems to be available at http://tools.wmflabs.org/kmlexport/ now. [16:42:55] Coren: hmm, such low CPU utilization on webgrid-02 http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1404146422.686&target=tools.tools-webgrid-01.&target=tools.tools-webgrid-01.cpu.total.user.&target=tools.tools-webgrid-02.cpu.total.user.value [16:43:29] Interesting, that's in graphite? [16:43:34] YuviPanda: Yeah, CPU usage is normally not the primary resource hog; though some web tools have significant peaks. [16:43:59] Krinkle: yeah, graphite.wmflabs.org has been collecting stats from all labs nodes for the last week. [16:44:10] ganglia-ish [16:44:22] Krinkle: labs ganglia has been dead for ages [16:44:26] I know that [16:44:38] Though ocasionally it's up for a few hours [16:44:40] Krinkle: and prod ganglia is going away too, I'm told. [16:45:06] Well, graphite definitelt isn't fulfilling for interface. Data-wise it's great though. [16:45:10] Krinkle: am setting up grafana (grafana.org) soon. [16:45:14] cool [16:45:16] Krinkle: for the interface. [16:45:23] Krinkle: then need to figure something out for monitoring [16:45:29] icinga? [16:45:36] anyway, back to work you [16:45:47] Krinkle: :) [16:45:53] Krinkle: I'm at this point just waiting for packages to install [16:46:19] Coren: I'm going to spin up another instance in parallel. [16:46:32] i don't have permission for webservice [16:46:33] Yeah, better use of time. [16:46:35] apparently [16:48:36] Why isn't Options +Indexes working in apache? [16:49:12] Dispenser: we have no apache [16:53:25] Coren: YuviPanda do i have to wait for a new web node? [16:53:31] Could we finally fucking fix the _AddDefaultCharset utf-8_ for everyone so we aren't constantly severing latin-1 [16:53:48] or whatever it is in nginx [16:54:00] aude: hey! probably, no new services are being done anyway [16:54:08] ? [16:54:10] Dispenser: hey! actually filing a bug would probably help more than random profanity :) [16:54:10] aude: That shouldn't give you permission errors. [16:54:39] http://tools.wmflabs.org/audetools/ [16:54:47] i did webservice restart [16:55:10] Coren: increase quota for tools? can't create new instance [16:55:37] I'd think it'd be standard after we switched to utf-8 ELEVEN YEARS AGO [16:55:37] aude: Right, that's not permissions that's because you service isn't starting because we're out of resources. The next node should be ready soon. [16:55:43] ok [16:55:52] i think permission was from yesterday [16:55:55] not an issue [16:55:57] the* [16:56:11] YuviPanda: Will do as soon as I return from grabbing some food; I'm starved. [16:56:20] Coren: ok :) [16:56:25] * aude tools are not critical... take your time [16:57:27] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~para/cgi-bin/kmlexport to Tool Labs - 10https://bugzilla.wikimedia.org/61540#c2 (10Thorsten) I tried to fix https://commons.wikimedia.org/wiki/Template:GeoGroupTemplate but did not succeed. At dewiki it works like this: https://de.wikipedia.org/w/i... [16:58:52] aude just name it aude instead of audetools [17:00:39] YuviPanda: dogeydogey would like to work a bit on labs monitoring. Can you think of a smallish subtask to start him on? He's in #wikimedia-operations I believe. [17:01:07] andrewbogott: oh, sure. [17:01:41] thx [17:01:46] andrewbogott: done [17:06:50] YuviPanda: I've read that page but... I've to use something like screen because I've to launch replace.py and look what it do, then when I'm sure that replace.py is working fine, I'd disconnect from core (tool) [17:07:17] Gnumarcoo: right, but the actual jobs will run on different machines than -login, since -login has limited resources. [17:07:33] Gnumarcoo: stdout is piped to a .out file in your tool's homedirectory so you can check that [17:10:40] YuviPanda: ok, now I'm sure that wmflabs has a defect ;) [17:10:48] Gnumarcoo: :) [17:10:59] andrewbogott: can you increase tools cpu/memory quota? [17:11:22] YuviPanda: yep... [17:11:37] andrewbogott: ty [17:13:36] YuviPanda: looks to me like it was just the RAM cap you were hitting? [17:13:43] andrewbogott: possibly, yeah. [17:13:45] unless you need 24 more cores [17:13:54] andrewbogott: hah :D probably not at this moment... [17:14:31] Gnumarcoo: start a screen on tools-login, then run 'qlogin -l h_vmem=400M' there, and run your bot within that new shell [17:14:35] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~para/cgi-bin/kmlexport to Tool Labs - 10https://bugzilla.wikimedia.org/61540#c3 (10Thorsten) And some minutes later it does not work anymore. WMFLabs showing error "No webservice - The URI you have requested, /kmlexport/?project=de&linksfrom=1&artic... [17:14:39] (adapt h_vmem=... to the amount you think you need) [17:17:58] valhallasw: ty, I'm rtfm about qlogin cause I don't know how it works, on toolserver was a little bit simple [17:19:21] (I've found this: https://bugzilla.wikimedia.org/show_bug.cgi?id=50248 ) [17:21:59] that's not related to this, though. [17:26:11] Krinkle: aude try again? tools-webgrid-03 is operational now [17:27:15] !log tools created tools-webgrid-03 and added it to the queue [17:27:17] Logged the message, Master [17:38:49] 3Tool Labs tools / 3[other]: Migrate https://toolserver.org/~erwin85/xcontribs.php to Tool Labs - 10https://bugzilla.wikimedia.org/60881 (10PiRSquared17) a:3PiRSquared17 [17:40:55] aude: do check if you still have issues [17:41:15] andrewbogott: thought on killing gmond from all labs instances? ganglia is dead, and diamond runs now anyway... [17:41:30] fine with me [17:41:33] andrewbogott: alright :) [17:41:35] um… make sure no one's using it in beta first? [17:41:52] andrewbogott: you mean betalabs? [17:41:53] andrewbogott: sure [17:42:00] yeah [17:43:27] !log tools created tools-webgrid-04, applying webnode role and running puppet [17:43:30] Logged the message, Master [17:44:47] YuviPanda: Yep, the two webservices I need are running now [17:44:49] thx [17:46:41] Krinkle: yw. adding one more node to give ourselves more breathing room. [17:47:59] andrewbogott: also I found out that we already have a logster module. going to start sending labs proxy error codes into graphite soon. [17:49:16] great! [17:50:46] Where do I put the nginx config? :-( [17:54:45] Anything else that should be added there? [17:54:59] (removed ganglia as YuviPanda said it's going away) [17:55:34] 3Wikimedia Labs / 3tools: screen doesn't work from within 'become' - 10https://bugzilla.wikimedia.org/50248#c10 (10Merlijn van Deen) valhallasw@tools-login:~$ chown valhallasw:tools.tsreports /dev/pts/164 valhallasw@tools-login:~$ chmod g+r /dev/pts/164 valhallasw@tools-login:~$ become tsreports tools.tsrepo... [17:57:13] Dispenser: have you read https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Web_services [17:59:21] The response headers says "Server: nginx/1.5.0", so it isn't lighttpd [17:59:59] Dispenser: sure, that's because we have nginx as a reverse proxy. [18:00:17] Dispenser: if you're looking for the nginx config itself, it's in the operations/puppet.git repository, under modules/dynamicproxy [18:08:54] !log deployment-prep Manually added symlink for /etc/apache/wmf on deployment-apache0[12] [18:08:56] Logged the message, Master [18:09:36] !log deployment-prep Beta apaches are broken with latest puppet config applied. Working to correct. [18:09:38] Logged the message, Master [18:10:33] 3Wikimedia Labs / 3wikitech-interface: Enable HSTS (HTTP Strict Transport Security) on Wikitech - 10https://bugzilla.wikimedia.org/67303#c1 (10Andre Klapper) 5UNCO>3RESO/DUP Hi fn84b. Thanks for taking the time to report this! This particular problem has already been reported into our bug tracking system... [18:48:30] !log deployment-prep Created symlink /apache -> /usr/local/apache on deployment-apache0[12] to fix docroot symlinks [18:48:32] Logged the message, Master [18:54:09] Anybody know if there's a problem with WMFlabs right now? [18:54:19] I tried to use tools.wmflabs.org/croptool but got the following error message: [18:54:30] Warning: Unknown: write failed: No space left on device (28) in Unknown on line 0 Warning: Unknown: Failed to write session data (files). Please verify that the current setting of session.save_path is correct (/var/lib/php5) in Unknown on line 0 [18:56:17] peteforsyth: there's some growing pains, so /var is probably full on the webserver where croptool runs. Coren / YuviPanda|food? ^ [18:57:22] peteforsyth: Yeah, some tools eat up all the disk space. Lemme to check. [18:57:29] Argh, why is tool labs hijacking 404 handling? If I output HTTP 404 from a php script with a body, that body is ignored and instead it serves a boiler plate of some kind [18:57:38] Coren / YuviPanda|food /var is full on tools-webgrid-01, -02, at 65% on -3... [18:57:39] (That incoming rush of new tools has really made things "fun" the last couple days) [18:57:57] OK thanks valhallasw and Coren. Think I should do this work manually, or hold on for 30 minutes or so? [18:58:00] Hm.. it's nice a fallback, but if there is actually a cgi handling it, I guess it shouldn't trigger [18:58:08] -tomcat is also at 200M free / 89% usage [18:58:11] (not a big deal to do manually if it'll take a while) [18:58:23] peteforsyth: I'll go clean it up now. [18:58:34] thanks, awesome :) [18:58:58] I don't want a custom 404 handler either. I'd like the 404 be handled as it was. Only let the custom 404 handler trigger if the file didn't exist and it wasn't handled by php already. [18:59:14] Recommendations? I use 404 as status in an API [19:00:06] Krinkle: ask Coren. I'd happily remove 404 hijacking for now [19:00:10] petan: why does | /shared/stikkit -i foo -a test -p -t Test | give an error on tools? [19:00:19] I gues ligghttpd has a 404 as well [19:00:29] so it's being hijacked at both levels I suspect [19:00:43] The latter is probably abiding php though [19:00:46] Krinkle: nothing on the lighttpd level as far as I can see [19:00:50] Krinkle: right but lighty you should be able to override [19:00:58] Yeah [19:01:00] Cool [19:01:05] so :) [19:01:17] Krinkle: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Default_configuration [19:01:26] Thx, there already [19:02:49] YuviPanda|food: diamond piles tons of logs, which doesn't help things; fyi. [19:03:19] I think this is done by the tools web nginx proxy [19:03:20] Coren: yeah i'll make the rotation more aggresive [19:03:34] Coren: but 2G for /var is not really good enough [19:03:52] I'd say the only thing the web proxy should do is serve the /? info pages and "no web service running" if applicable. [19:04:07] 404 403 etc, is naturally handled by lighty [19:04:21] YuviPanda|food: Not for big logs no; but there is a biglog class we could use [19:04:53] Coren: yeah. I was gonna sumbit a patch including that as standard. will do so after food [19:04:55] we could make light's default 404/403/500 pages look similar if we want [19:05:00] Krinkle: I agree. [19:05:16] Krinkle: there's even a bug for it from liangent, forgot numer [19:05:18] that way it'd only apply if the request is unhandled, not if the backend is serving the page [19:05:21] (also eating) [19:05:21] cool [19:05:24] * Krinkle too [19:05:29] Shoarma [19:05:31] :P [19:05:32] you? [19:06:15] Krinkle: It's actually fairly important that the default 404, 403 and 500 are explicit about the tool involved, list the maintainers, and present helpful hints. Since I had those in, support requests and bug reports for simple errors dropped tenfold. [19:06:45] Krinkle: egg fried rice and chilli eggs [19:06:47] Is that dynamic info accessible from lighttpd? [19:07:01] e.g. reimplement them on the backend as overridable default instead of on top [19:07:37] Coren: all better, many thanks! [19:07:54] or an imnotanidiot flag for the tool in ldap. [19:08:59] or a special http header to force it that nginx proxy filters out once detected [19:09:01] anything [19:10:05] Krinkle: Realistically, I've nothing against the idea but my bandwidth is limited. The chances of getting a +2 quickly to a patch are high, the chances of my making said patch are low in the short term. :-) [19:11:23] Including a patch that rips out that handling for the short term? [19:11:25] Krinkle: YuviPanda|food https://bugzilla.wikimedia.org/show_bug.cgi?id=64393 [19:16:15] Krinkle: ripping it out for 404s seems the short term ideal. [19:16:30] Krinkle: urlproxy.conf in dynamicproxy module [19:19:36] Krinkle: Much less so. Especially for 404 which, strictly speaking, a running script normally shouldn't return (I'd be more easily convinced for 500 as a script failing and reporting how it fails makes semantic sense) [19:20:32] Krinkle: But making it conditional to the existence of a header would be a-ok in my book. [19:20:58] Coren: The pages should be simplified to short 404 Not Found (default lightttpd style) with a pointer to https://tools.wmflabs.org/?tool= [19:21:08] where any further user names and contact information could go [19:21:35] headers are suboptimal as it's be a non-standard thing, hard to debug or scale or even implement when using libraries for RESTful apis [19:22:09] RESTful apis should not be relying on the body of a non-2XX response in the first place. :-) [19:22:25] But anyway, aside from the design, the use case your'e addressing primarily I guess is file system handling [19:22:46] For 404, definitely. [19:22:51] and 403 [19:22:55] if the actual web app is emitting a reponse, at all, (by rewrite path or whatever) it is intentional and the proxy shouldn't intervene. [19:23:26] Anyway, I don't mind, as long as stuff works. When stuff doesn't work, I don't mind what the error looks like [19:23:26] https://gerrit.wikimedia.org/r/143081 [19:23:51] but the app emitting a 404 isn't an error for the proxy (e.g. accesing a redlink on a wiki) [19:24:20] implementing them as-is on lighttpd would be my preferred solution. I'll see if I can do that a few weeks from now maybe [19:24:34] that keeps all the useful info, but makes it not apply to web apps [19:26:41] I'm unconvinced that this is beneficial most cases; I'd rather allow a way to provide for edge cases explicitly. But lemme think about it a bit while I finish working on that [bleep]ing federation issue. [19:26:44] Coren: I agree, non 2xx is tricky and slippery, but I don't think its appropriate for tool-labs, labs or any thing in between to enforce or demand certain practices. That's imho too low level. For operations in general, but especially in a low-entry culture where making such demands can easily cascade dozens of layers of abstraction before it can be fixed. [19:28:49] I think the argument goes the other way 'round actually (where low-entry means that more marginal practice can be made harder to simplify the general case); but I really got to figure that bug I'm working on atm so now is a bad time to think about this in detail. [19:38:32] I'm confused how server.error-handler-404 += "/error-404.php" still works if nginx rewrites it [19:40:20] hi. Are there old installations of mediawiki to run tests against? [19:43:36] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c4 (10Bryan Davis) When I revert the local change in the mediawiki::users class that sets the shell for User['mwdeploy'] to /bin/bash and let... [19:55:45] Coren: Can you fire up the web service for https://tools.wmflabs.org/paste/ ? [19:56:24] multichill: {{done}} [19:56:59] You should probably just have a script go over https://tools.wmflabs.org/ and fire up all the tools that are linked but not running [19:57:25] hi Mpaa - does https://www.mediawiki.org/wiki/Release_notes help? [19:57:28] has links [19:58:11] !log tools added tools-webgrid-04 to webgrid queue, had to start portgranter manually [19:58:14] Logged the message, Master [19:58:21] Coren: webgrid-04 operational now [19:58:50] Moar powaaar! [20:00:32] Coren: :) [20:01:05] Coren: so, things left to put to graphite would be 1. grid engine stats, 2. proxy error / response time / throughput stats [20:01:17] Coren: I'll pick up (2) sometime soon [20:02:41] Coren: might not get time for the next few days, on account of $DAYJOB :) [20:06:11] Coren: I love you Saturday reply about jenkins isolation :-D [20:06:27] Coren: I am never going to be able to make a decision now hehe [20:14:20] hashar: Oh noes! Choices! [20:17:35] hey scfc_de [20:20:50] Coren: so tools-webgrid-04 is idle now, but I guess that's ok. [20:21:21] YuviPanda: You'd expect the next webservice to end up there, in theory. [20:21:28] Coren: yeah [20:21:41] Coren: also, any other metrics you think might be useful? [20:22:25] YuviPanda: I expect your gridengine collector will grab the equivalent to qstat -g c already... [20:22:43] YuviPanda: Having an idea of the mail queue lengths could be useful to see if there are issues with email [20:22:48] Coren: yeah, I expect so too. I will read through the code (and qstat manpage) before I do that [20:22:54] Coren: oh yeah, there's one for that as well [20:32:34] YuviPanda: Didn't read backlog, but if you're doing monitoring, I submitted some Bugzillas for those (mail queue lengths, etc.). [20:32:41] scfc_de: oh, links? [20:32:58] Hnm.. webservice not starting again? I'm fiddling with lighttpd a lot, so I'm basically constantly restarting [20:33:18] YuviPanda: I'll post them later; otherwise just search for reporter = me. [20:33:25] scfc_de: alright. [20:33:36] Krinkle: hmm? should be. we have a fully spare machine standing by as well. [20:33:38] 'websercive restart' is returning suspiciously fast [20:33:47] It worked 5 minutes ago [20:33:58] Krinkle: are you sure there's no errors in your lighty local conf? [20:34:17] Right, that blocks the service tiself from starting [20:34:22] thx [20:34:33] Krinkle: :) [20:34:41] found it in error.log [20:35:13] scfc_de: adding mointoring for queue size now [20:39:00] scfc_de: found https://bugzilla.wikimedia.org/show_bug.cgi?id=58871 [20:39:09] scfc_de: why should we check it on each host? isn't checking on -mail enough? [20:40:25] YuviPanda: In case the connection between host X and tools-mail is severed (temporarily) and the mails are stuck on host X. [20:40:55] scfc_de: ah, hmm. would that be useful? wouldn't the lack of mail on tools-mail alert anyway? [20:41:01] (once we have alert, in the glorious future) [20:42:49] Coren: hey, wondering about "gordon" [20:43:15] would that be just like dickson? [20:43:39] mutante: Yeah, it was intended to be a second IP allocated to the box that would have just management on it so that we could shut down everything but IRCd on the other [20:43:50] * YuviPanda wonders if gordon has flash drives [20:43:54] mutante: It's still a "live" project but it fell by the wayside. [20:43:58] YuviPanda: *g* nice one [20:44:06] As more important things popped up. [20:44:17] Coren: then i just have this comment here https://gerrit.wikimedia.org/r/#/c/115093/3/templates/155.80.208.in-addr.arpa [20:44:24] saw that DNS change , that was all [20:44:24] mutante: :) [20:44:26] thanks, gotcha [20:45:23] YuviPanda: I wasn't thinking about 1/x of all mails suddenly being stuck, but, for example due to a DNS glitch, a few dozen being stuck on a single host. [20:45:39] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=48668 is grid monitoring. [20:46:07] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=48694 is replag graphing in Ganglia (-> Graphite). Should have some alert (> x hours) as well. [20:46:31] scfc_de: right. am unsure how to do replag [20:46:41] scfc_de: or anything with the dbs at all, for that matter. they live in prod... [20:46:47]