[08:41:33] (03PS1) 10Nemo bis: Move LiquidThreads back to main channel [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142991 [08:42:25] (03CR) 10Yuvipanda: [C: 032 V: 032] "LQT has maintainers? ;)" [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142991 (owner: 10Nemo bis) [08:45:04] (03PS1) 10Nemo bis: Restrict ContentTranslation to #mediawiki-i18n [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142992 [08:53:43] (03PS2) 10Nemo bis: Restrict ContentTranslation to #mediawiki-i18n [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/142992 [09:05:40] YuviPanda: now it merges :) [10:43:35] hello, anyone there to answer a puppet question on labs? [10:44:43] nuria@nuria-worker1:~$ puppet agent -tv [10:44:43] I get an error like 'Could not request certificate: getaddrinfo: Name or service not known' when trying to run puppet agent [11:09:44] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054436 edit summary: /* Zeitplan */ kl. Aktualisierungen [11:13:30] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054438 edit summary: /* Schedule */ small updates [12:17:09] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054459 edit summary: /* Schedule */ [12:17:40] Cyberpower678, John F. Lewis, and Quentinv57: sulinfo is down (webservice crashed yesterday for many users, unrelated issue, but needs manual start) [12:17:45] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054460 edit summary: /* Schedule */ [12:19:00] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054461 edit summary: /* Schedule */ [12:21:00] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054468 edit summary: /* Zeitplan */ [12:21:27] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1054470 edit summary: /* Zeitplan */ [13:09:32] 3Wikimedia Labs: Mail notifications from fab.wmflabs.org delivered only days later (or not at all?) - 10https://bugzilla.wikimedia.org/65861#c18 (10Marc A. Pelletier) 5REOP>3RESO/FIX Same logfile. I've nuked it. [14:15:29] * enhydra slowpoked over all the migration brouhaha -_- [14:15:52] will it be possible to edit redirects after today? [14:16:10] (I mean my .htaccess on Toolserver) [14:19:05] 3Wikimedia Labs / 3wikitech-interface: Enable HSTS (HTTP Strict Transport Security) on Wikitech - 10https://bugzilla.wikimedia.org/67303 (10fn84b) 3UNCO p:3Unprio s:3normal a:3None Wikitech requires HTTPS connections (http://wikitech.wikimedia.org redirects to https://wikitech.wikimedia.org), so coul... [14:57:49] Aaaauugh! [14:58:00] * Coren beats up mysql. [14:59:34] enhydra: just mail tsadmins [15:06:44] enhydra: Yes, valhallasw is right - nosy and amette will create missing redirects from tomorrow as you won't be able to log in anymore. [15:07:05] nice, thank you [15:13:53] R.I.P Toolserver and all the tools that did not make the cut :'( [15:16:06] Coren: hey [15:16:49] YuviPanda: Heyo. I'm wrangling tables atm, but I'll be ready to do that node addition in ~30m. Okay with you? [15:17:16] Coren: yeah :) [15:33:55] /var/run/lighttpd/audetools.conf~: Permission denied [15:33:58] /var/run/lighttpd/audetools.conf~: Permission denied [15:34:08] bah /var/run/lighttpd/audetools.conf~: Permission denied [15:34:18] gah, scroll [15:41:06] webservice status Your webservice is scheduled: queue instance "webgrid-lighttpd@tools-webgrid-02.eqiad.wmflabs" dropped because it is temporarily not available [15:41:24] Does it mean that I have to wait for the webservice to start? If yes, is there and estimate? [15:44:23] Guest57626: It means that it couldn't start on tools-webgrid-02 at that moment, though it might start on -01 very quickly. [15:44:38] andrewbogott: Do you think we could work on https://bugzilla.wikimedia.org/show_bug.cgi?id=65591 sometime this week? I think all the hard parts would belong to me (fixing beta scap if the shell change breaks it). [15:44:57] But yeah, I'm about to add a new node to have more room for more services -- the ts going down has given us a big uptake over the past couple days. [15:45:22] Speaking of, YuviPanda, you want to be the one doing it while I guide you? [15:45:46] bd808: I'm happy to help, but don't immediately understand… (I don't know much about how deployment works) [15:46:39] andrewbogott: Mostly I'd like the labs ldap for the mwdeploy user to match up with the prod settings from puppet. [15:47:00] Ah, so just changing the 'shell' field in the ldap entry? [15:47:11] yes. [15:47:49] The possible breakage would be that I used the mwdeploy user to do ssh connections between hosts in beta, so I may have to tweak my ssh automation script [15:49:00] Coren: yea [15:49:17] bd808: do you want me to go ahead and change it, and then you can see what breaks? [15:49:46] andrewbogott: Sure. I have time this morning to deal with the possible fallout [15:50:53] so, in ldap, user mwdeploy currently has loginShell: /usr/local/bin/sillyshell [15:51:02] which is different from what the bug says... [15:52:10] Huh. Setting the shell to /bin/bash made puppet happy. I wonder if there is local settings on the beta hosts? [15:52:18] I'll look quickly. [15:53:08] Coren: in the meantime, +2 https://gerrit.wikimedia.org/r/#/c/142208/? :) is already on the general proxy [15:54:18] andrewbogott: There is no local /etc/password entry for mwdeploy on deployment-bastion or deployment-apache01. I don't know why puppet would be happy with shell=>/bin/bash if that doesn't match ldap. [15:56:54] bd808: are we talking about prod or beta now? In prod I don't think ldap is much involved. [15:57:23] andrewbogott: I'm talking about beta. [15:59:21] bd808: hm, I'm not totally sure what happens if puppet tries to create a user who already exists in ldap. [16:00:06] andrewbogott: It seems to check the info from ldap and if it matches it's happy to move on. We have done this for several other users and groups. [16:00:41] If the info doesn't match then puppet will error on the user resource [16:01:59] bd808: ok, but, to reconfirm… you want me to switch it to 'false' in ldap, not to bash, right? [16:03:33] andrewbogott: /bin/false please, yes. [16:04:08] That will make it match the define in mediawiki::users [16:04:17] ok! Stay tuned... [16:05:01] bd808: ok, done [16:05:32] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c3 (10Andrew Bogott) In ldap, user mwdeploy now has loginShell: /bin/false [16:05:48] * bd808 checks to see if puppet is appropriately sad now [16:08:23] YuviPanda: you were working on monitoring for tool labs at some point, right? [16:08:27] Hm.. webservices aren't starting [16:08:39] Existing ones are running fine, btu I can't seem to start any [16:08:40] valhallasw: I've graphite.wmflabs.org setup. [16:08:44] https://tools.wmflabs.org/wikiinfo/? [16:08:48] valhallasw: announcement soon. [16:09:02] Krinkle: might be just out of resources. I'm in the process of adding a new node as we speak. [16:09:11] Tried to start it a couple times now. cli status says it's running. grid status says it doesn't exist. [16:09:29] YuviPanda: oh, cool. gerrit reviewer bot was down again for a week or so without me noticing :-( [16:09:29] Krinkle: it might also be stuck in a restart loop for some reason? [16:09:36] valhallasw: awgh, damn. [16:09:45] valhallasw: yeah, we can't really have reliability without monitoring, no [16:09:46] OK, well I'm stuck in a 300 layer migration night mare. I can't deal with this as well now. [16:09:53] last minute toolserver migrations [16:10:01] Hi! Excuse me, if I want to run my bot on wmf labs (via tools, obviously) as "screen"... how can I do it? [16:11:26] andrewbogott: Apparently puppet for the beta apaches is pretty messed up at the moment, so it's going to take me a bit to untangle problems from production changes and finish testing. I'll ping you or c.oren later if I get all other problems fixed and still have issues from the mwdeploy shell change. [16:11:43] andrewbogott: Thanks for your help [16:12:01] bd808: ok. Ori has been working on Apache puppet code lately, so he might have some quick answers for the puppet failures [16:12:26] YuviPanda: oh, but I see graphite needs an extra tool to do notifications [16:12:27] Krinkle: Hurry up... ;) [16:13:13] valhallasw: yeah, cabotapp or icinga for that [16:13:31] Yeah. The issues I'm seeing right off the bat are caused by things that needed to be cleaned up in production but are in conflict with hacks that I put in beta :( [16:13:40] Silke_WMDE1: Easier said than done. https://github.com/Krinkle/ts-krinkle-misc https://github.com/search?q=user%3AKrinkle+tool&type=Repositories&ref=searchresults [16:13:41] valhallasw: currently working on grafana for prettier graphs before I tackle notifications [16:13:50] YuviPanda: yeah, but icinga needs root access to add reporters iirc [16:14:04] There's so much material. I've been migrating bits every other day since November. Mostly blocked by random things. [16:14:30] Some of the dependencies only came available a few days ago. [16:14:50] and tools seems to be having difficulties every other week. Nothing structural but just bad luck. [16:15:11] Every time I sit down to do something, in a few hours I run into an outage or a thing not working... [16:16:46] Krinkle: Part of it is the sudden inrush of new users and tools; this is why we're expanding labs right now. [16:16:48] Krinkle: thumbs crossed! [16:17:31] Silke_WMDE1: shutdown is tomorrow right? not at midnight? [16:17:51] also accounting for different timezones ideally [16:17:58] I've stupid question... How can I run a script also when I'm disconnect to wmflabs-tools? There's something like "screen" or I've to use crontab..? :/ [16:18:02] it's at tonight at 1:00 UTC [16:18:40] Gnumarcoo: see https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Submitting.2C_managing_and_scheduling_jobs_on_the_grid [16:19:12] YuviPanda: uh, thanks a lot! [16:19:17] Gnumarcoo: yw [16:21:05] Nemo_bis: thanks for that idea, it didn't come to my mind. [16:21:44] Silke_WMDE1: I feel two weeks notice time + backups only until the end of august is a bit short on time, to be honest, especially given the summer holidays. [16:24:21] valhallasw: I see. [16:25:27] valhallasw: We wanted to limit it because we can't follow that page forever. And we'll still have to decommission the actual hardware at some point. [16:25:52] Right. How much data is there? Would it be feasable to have it backed up in it's entirety somewhere? [16:25:53] Silke_WMDE1: glad you liked it [16:25:58] 'become ' [16:26:02] sudo: unable to mkdir /var/lib/sudo/: No space left on device [16:26:13] tools-login [16:26:20] /dev/vda2 1.9G 1.9G 0 100% /var [16:26:21] gurr [16:26:21] Help? [16:26:22] valhallasw: You'll have to ask nosy. [16:26:27] Silke_WMDE1: ok [16:26:34] Krinkle: looking now [16:26:49] valhallasw: we can't tar.gz everything from the toolserver on the toolserver. [16:27:06] (which is somehow clear) [16:27:11] hmhm [16:27:32] Krinkle: valhallasw try again? [16:28:46] $ become krinkle-redirect [16:28:46] sudo: sorry, a password is required to run sudo [16:28:48] Still weird [16:28:51] ? [16:29:12] I can become 'wikiinfo' now, though its webservice also won't start. [16:29:34] Krinkle: did you log out and in again? [16:29:43] 'webservice status' should return a job id, but instead outputs an empty string followed by one of the continuous grid servers not being available [16:30:04] valhallasw: worked, now I can become krinkle-redirect [16:30:21] Coren: ^ seems webgrid-02 is dead again? [16:30:45] YuviPanda: Pedal faster, bring those new nodes in! :-) [16:30:46] queue instance "continuous@tools-exec-01.eqiad.wmflabs" dropped because it is temporarily not available [16:30:51] Coren: :) [16:31:07] But no, it seems up and running. [16:31:18] They're just full. [16:31:19] Coren: gridmaster seems to think otherwise... [16:31:21] ah [16:31:29] nice time to run into that :) [16:32:02] I didn't expect there would be that many toolserver users waiting for the very last couple days to all move at once. [16:32:17] So we got a serious bump in the number of users and tools. [16:32:40] Coren: Can you tell how many? [16:32:50] Krinkle: We're out of room on the web nodes; but we're (literally) in the middle of adding more. [16:32:59] Is there anything I can do to have the webservice start? Since it's near impossible to debug and develop locally (due to labsdb etc.) I basically can't do anything right now [16:32:59] OK.. [16:33:27] Krinkle: fwiw, you *can* access labsdb locally via an ssh tunnel fairly simply. [16:33:51] Silke_WMDE: Not precisely, since I can't tell the reason for account creation, etc. But a quick eyeball estimate makes me think about 15% sudden increase in the number of tools over the past 2-3 days. [16:34:03] hehe [16:34:18] YuviPanda: Hm.. I tried that but it didn't work [16:34:22] each slice? [16:34:56] I can't have them all on the same port [16:35:07] Or can I somehow forward *.labsdb resolution to there? [16:35:44] Krinkle: I did that by copy pasting /etc/hosts from tools-login to my local machine, IIRC [16:36:18] Those local IPs don't resolve locally [16:36:37] I only need sX.labsdb, but still [16:36:39] Krinkle: hmm, I'll check up in a while. bringing up the new webgrid node right now, sorry [16:36:42] k [16:36:45] but I do distinctly remember having them work [16:40:15] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~para/cgi-bin/kmlexport to Tool Labs - 10https://bugzilla.wikimedia.org/61540#c1 (10Thorsten) Seems to be available at http://tools.wmflabs.org/kmlexport/ now. [16:42:55] Coren: hmm, such low CPU utilization on webgrid-02 http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1404146422.686&target=tools.tools-webgrid-01.&target=tools.tools-webgrid-01.cpu.total.user.&target=tools.tools-webgrid-02.cpu.total.user.value [16:43:29] Interesting, that's in graphite? [16:43:34] YuviPanda: Yeah, CPU usage is normally not the primary resource hog; though some web tools have significant peaks. [16:43:59] Krinkle: yeah, graphite.wmflabs.org has been collecting stats from all labs nodes for the last week. [16:44:10] ganglia-ish [16:44:22] Krinkle: labs ganglia has been dead for ages [16:44:26] I know that [16:44:38] Though ocasionally it's up for a few hours [16:44:40] Krinkle: and prod ganglia is going away too, I'm told. [16:45:06] Well, graphite definitelt isn't fulfilling for interface. Data-wise it's great though. [16:45:10] Krinkle: am setting up grafana (grafana.org) soon. [16:45:14] cool [16:45:16] Krinkle: for the interface. [16:45:23] Krinkle: then need to figure something out for monitoring [16:45:29] icinga? [16:45:36] anyway, back to work you [16:45:47] Krinkle: :) [16:45:53] Krinkle: I'm at this point just waiting for packages to install [16:46:19] Coren: I'm going to spin up another instance in parallel. [16:46:32] i don't have permission for webservice [16:46:33] Yeah, better use of time. [16:46:35] apparently [16:48:36] Why isn't Options +Indexes working in apache? [16:49:12] Dispenser: we have no apache [16:53:25] Coren: YuviPanda do i have to wait for a new web node? [16:53:31] Could we finally fucking fix the _AddDefaultCharset utf-8_ for everyone so we aren't constantly severing latin-1 [16:53:48] or whatever it is in nginx [16:54:00] aude: hey! probably, no new services are being done anyway [16:54:08] ? [16:54:10] Dispenser: hey! actually filing a bug would probably help more than random profanity :) [16:54:10] aude: That shouldn't give you permission errors. [16:54:39] http://tools.wmflabs.org/audetools/ [16:54:47] i did webservice restart [16:55:10] Coren: increase quota for tools? can't create new instance [16:55:37] I'd think it'd be standard after we switched to utf-8 ELEVEN YEARS AGO [16:55:37] aude: Right, that's not permissions that's because you service isn't starting because we're out of resources. The next node should be ready soon. [16:55:43] ok [16:55:52] i think permission was from yesterday [16:55:55] not an issue [16:55:57] the* [16:56:11] YuviPanda: Will do as soon as I return from grabbing some food; I'm starved. [16:56:20] Coren: ok :) [16:56:25] * aude tools are not critical... take your time [16:57:27] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~para/cgi-bin/kmlexport to Tool Labs - 10https://bugzilla.wikimedia.org/61540#c2 (10Thorsten) I tried to fix https://commons.wikimedia.org/wiki/Template:GeoGroupTemplate but did not succeed. At dewiki it works like this: https://de.wikipedia.org/w/i... [16:58:52] aude just name it aude instead of audetools [17:00:39] YuviPanda: dogeydogey would like to work a bit on labs monitoring. Can you think of a smallish subtask to start him on? He's in #wikimedia-operations I believe. [17:01:07] andrewbogott: oh, sure. [17:01:41] thx [17:01:46] andrewbogott: done [17:06:50] YuviPanda: I've read that page but... I've to use something like screen because I've to launch replace.py and look what it do, then when I'm sure that replace.py is working fine, I'd disconnect from core (tool) [17:07:17] Gnumarcoo: right, but the actual jobs will run on different machines than -login, since -login has limited resources. [17:07:33] Gnumarcoo: stdout is piped to a .out file in your tool's homedirectory so you can check that [17:10:40] YuviPanda: ok, now I'm sure that wmflabs has a defect ;) [17:10:48] Gnumarcoo: :) [17:10:59] andrewbogott: can you increase tools cpu/memory quota? [17:11:22] YuviPanda: yep... [17:11:37] andrewbogott: ty [17:13:36] YuviPanda: looks to me like it was just the RAM cap you were hitting? [17:13:43] andrewbogott: possibly, yeah. [17:13:45] unless you need 24 more cores [17:13:54] andrewbogott: hah :D probably not at this moment... [17:14:31] Gnumarcoo: start a screen on tools-login, then run 'qlogin -l h_vmem=400M' there, and run your bot within that new shell [17:14:35] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~para/cgi-bin/kmlexport to Tool Labs - 10https://bugzilla.wikimedia.org/61540#c3 (10Thorsten) And some minutes later it does not work anymore. WMFLabs showing error "No webservice - The URI you have requested, /kmlexport/?project=de&linksfrom=1&artic... [17:14:39] (adapt h_vmem=... to the amount you think you need) [17:17:58] valhallasw: ty, I'm rtfm about qlogin cause I don't know how it works, on toolserver was a little bit simple [17:19:21] (I've found this: https://bugzilla.wikimedia.org/show_bug.cgi?id=50248 ) [17:21:59] that's not related to this, though. [17:26:11] Krinkle: aude try again? tools-webgrid-03 is operational now [17:27:15] !log tools created tools-webgrid-03 and added it to the queue [17:27:17] Logged the message, Master [17:38:49] 3Tool Labs tools / 3[other]: Migrate https://toolserver.org/~erwin85/xcontribs.php to Tool Labs - 10https://bugzilla.wikimedia.org/60881 (10PiRSquared17) a:3PiRSquared17 [17:40:55] aude: do check if you still have issues [17:41:15] andrewbogott: thought on killing gmond from all labs instances? ganglia is dead, and diamond runs now anyway... [17:41:30] fine with me [17:41:33] andrewbogott: alright :) [17:41:35] um… make sure no one's using it in beta first? [17:41:52] andrewbogott: you mean betalabs? [17:41:53] andrewbogott: sure [17:42:00] yeah [17:43:27] !log tools created tools-webgrid-04, applying webnode role and running puppet [17:43:30] Logged the message, Master [17:44:47] YuviPanda: Yep, the two webservices I need are running now [17:44:49] thx [17:46:41] Krinkle: yw. adding one more node to give ourselves more breathing room. [17:47:59] andrewbogott: also I found out that we already have a logster module. going to start sending labs proxy error codes into graphite soon. [17:49:16] great! [17:50:46] Where do I put the nginx config? :-( [17:54:45] Anything else that should be added there? [17:54:59] (removed ganglia as YuviPanda said it's going away) [17:55:34] 3Wikimedia Labs / 3tools: screen doesn't work from within 'become' - 10https://bugzilla.wikimedia.org/50248#c10 (10Merlijn van Deen) valhallasw@tools-login:~$ chown valhallasw:tools.tsreports /dev/pts/164 valhallasw@tools-login:~$ chmod g+r /dev/pts/164 valhallasw@tools-login:~$ become tsreports tools.tsrepo... [17:57:13] Dispenser: have you read https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Web_services [17:59:21] The response headers says "Server: nginx/1.5.0", so it isn't lighttpd [17:59:59] Dispenser: sure, that's because we have nginx as a reverse proxy. [18:00:17] Dispenser: if you're looking for the nginx config itself, it's in the operations/puppet.git repository, under modules/dynamicproxy [18:08:54] !log deployment-prep Manually added symlink for /etc/apache/wmf on deployment-apache0[12] [18:08:56] Logged the message, Master [18:09:36] !log deployment-prep Beta apaches are broken with latest puppet config applied. Working to correct. [18:09:38] Logged the message, Master [18:10:33] 3Wikimedia Labs / 3wikitech-interface: Enable HSTS (HTTP Strict Transport Security) on Wikitech - 10https://bugzilla.wikimedia.org/67303#c1 (10Andre Klapper) 5UNCO>3RESO/DUP Hi fn84b. Thanks for taking the time to report this! This particular problem has already been reported into our bug tracking system... [18:48:30] !log deployment-prep Created symlink /apache -> /usr/local/apache on deployment-apache0[12] to fix docroot symlinks [18:48:32] Logged the message, Master [18:54:09] Anybody know if there's a problem with WMFlabs right now? [18:54:19] I tried to use tools.wmflabs.org/croptool but got the following error message: [18:54:30] Warning: Unknown: write failed: No space left on device (28) in Unknown on line 0 Warning: Unknown: Failed to write session data (files). Please verify that the current setting of session.save_path is correct (/var/lib/php5) in Unknown on line 0 [18:56:17] peteforsyth: there's some growing pains, so /var is probably full on the webserver where croptool runs. Coren / YuviPanda|food? ^ [18:57:22] peteforsyth: Yeah, some tools eat up all the disk space. Lemme to check. [18:57:29] Argh, why is tool labs hijacking 404 handling? If I output HTTP 404 from a php script with a body, that body is ignored and instead it serves a boiler plate of some kind [18:57:38] Coren / YuviPanda|food /var is full on tools-webgrid-01, -02, at 65% on -3... [18:57:39] (That incoming rush of new tools has really made things "fun" the last couple days) [18:57:57] OK thanks valhallasw and Coren. Think I should do this work manually, or hold on for 30 minutes or so? [18:58:00] Hm.. it's nice a fallback, but if there is actually a cgi handling it, I guess it shouldn't trigger [18:58:08] -tomcat is also at 200M free / 89% usage [18:58:11] (not a big deal to do manually if it'll take a while) [18:58:23] peteforsyth: I'll go clean it up now. [18:58:34] thanks, awesome :) [18:58:58] I don't want a custom 404 handler either. I'd like the 404 be handled as it was. Only let the custom 404 handler trigger if the file didn't exist and it wasn't handled by php already. [18:59:14] Recommendations? I use 404 as status in an API [19:00:06] Krinkle: ask Coren. I'd happily remove 404 hijacking for now [19:00:10] petan: why does | /shared/stikkit -i foo -a test -p -t Test | give an error on tools? [19:00:19] I gues ligghttpd has a 404 as well [19:00:29] so it's being hijacked at both levels I suspect [19:00:43] The latter is probably abiding php though [19:00:46] Krinkle: nothing on the lighttpd level as far as I can see [19:00:50] Krinkle: right but lighty you should be able to override [19:00:58] Yeah [19:01:00] Cool [19:01:05] so :) [19:01:17] Krinkle: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Default_configuration [19:01:26] Thx, there already [19:02:49] YuviPanda|food: diamond piles tons of logs, which doesn't help things; fyi. [19:03:19] I think this is done by the tools web nginx proxy [19:03:20] Coren: yeah i'll make the rotation more aggresive [19:03:34] Coren: but 2G for /var is not really good enough [19:03:52] I'd say the only thing the web proxy should do is serve the /? info pages and "no web service running" if applicable. [19:04:07] 404 403 etc, is naturally handled by lighty [19:04:21] YuviPanda|food: Not for big logs no; but there is a biglog class we could use [19:04:53] Coren: yeah. I was gonna sumbit a patch including that as standard. will do so after food [19:04:55] we could make light's default 404/403/500 pages look similar if we want [19:05:00] Krinkle: I agree. [19:05:16] Krinkle: there's even a bug for it from liangent, forgot numer [19:05:18] that way it'd only apply if the request is unhandled, not if the backend is serving the page [19:05:21] (also eating) [19:05:21] cool [19:05:24] * Krinkle too [19:05:29] Shoarma [19:05:31] :P [19:05:32] you? [19:06:15] Krinkle: It's actually fairly important that the default 404, 403 and 500 are explicit about the tool involved, list the maintainers, and present helpful hints. Since I had those in, support requests and bug reports for simple errors dropped tenfold. [19:06:45] Krinkle: egg fried rice and chilli eggs [19:06:47] Is that dynamic info accessible from lighttpd? [19:07:01] e.g. reimplement them on the backend as overridable default instead of on top [19:07:37] Coren: all better, many thanks! [19:07:54] or an imnotanidiot flag for the tool in ldap. [19:08:59] or a special http header to force it that nginx proxy filters out once detected [19:09:01] anything [19:10:05] Krinkle: Realistically, I've nothing against the idea but my bandwidth is limited. The chances of getting a +2 quickly to a patch are high, the chances of my making said patch are low in the short term. :-) [19:11:23] Including a patch that rips out that handling for the short term? [19:11:25] Krinkle: YuviPanda|food https://bugzilla.wikimedia.org/show_bug.cgi?id=64393 [19:16:15] Krinkle: ripping it out for 404s seems the short term ideal. [19:16:30] Krinkle: urlproxy.conf in dynamicproxy module [19:19:36] Krinkle: Much less so. Especially for 404 which, strictly speaking, a running script normally shouldn't return (I'd be more easily convinced for 500 as a script failing and reporting how it fails makes semantic sense) [19:20:32] Krinkle: But making it conditional to the existence of a header would be a-ok in my book. [19:20:58] Coren: The pages should be simplified to short 404 Not Found (default lightttpd style) with a pointer to https://tools.wmflabs.org/?tool= [19:21:08] where any further user names and contact information could go [19:21:35] headers are suboptimal as it's be a non-standard thing, hard to debug or scale or even implement when using libraries for RESTful apis [19:22:09] RESTful apis should not be relying on the body of a non-2XX response in the first place. :-) [19:22:25] But anyway, aside from the design, the use case your'e addressing primarily I guess is file system handling [19:22:46] For 404, definitely. [19:22:51] and 403 [19:22:55] if the actual web app is emitting a reponse, at all, (by rewrite path or whatever) it is intentional and the proxy shouldn't intervene. [19:23:26] Anyway, I don't mind, as long as stuff works. When stuff doesn't work, I don't mind what the error looks like [19:23:26] https://gerrit.wikimedia.org/r/143081 [19:23:51] but the app emitting a 404 isn't an error for the proxy (e.g. accesing a redlink on a wiki) [19:24:20] implementing them as-is on lighttpd would be my preferred solution. I'll see if I can do that a few weeks from now maybe [19:24:34] that keeps all the useful info, but makes it not apply to web apps [19:26:41] I'm unconvinced that this is beneficial most cases; I'd rather allow a way to provide for edge cases explicitly. But lemme think about it a bit while I finish working on that [bleep]ing federation issue. [19:26:44] Coren: I agree, non 2xx is tricky and slippery, but I don't think its appropriate for tool-labs, labs or any thing in between to enforce or demand certain practices. That's imho too low level. For operations in general, but especially in a low-entry culture where making such demands can easily cascade dozens of layers of abstraction before it can be fixed. [19:28:49] I think the argument goes the other way 'round actually (where low-entry means that more marginal practice can be made harder to simplify the general case); but I really got to figure that bug I'm working on atm so now is a bad time to think about this in detail. [19:38:32] I'm confused how server.error-handler-404 += "/error-404.php" still works if nginx rewrites it [19:40:20] hi. Are there old installations of mediawiki to run tests against? [19:43:36] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c4 (10Bryan Davis) When I revert the local change in the mediawiki::users class that sets the shell for User['mwdeploy'] to /bin/bash and let... [19:55:45] Coren: Can you fire up the web service for https://tools.wmflabs.org/paste/ ? [19:56:24] multichill: {{done}} [19:56:59] You should probably just have a script go over https://tools.wmflabs.org/ and fire up all the tools that are linked but not running [19:57:25] hi Mpaa - does https://www.mediawiki.org/wiki/Release_notes help? [19:57:28] has links [19:58:11] !log tools added tools-webgrid-04 to webgrid queue, had to start portgranter manually [19:58:14] Logged the message, Master [19:58:21] Coren: webgrid-04 operational now [19:58:50] Moar powaaar! [20:00:32] Coren: :) [20:01:05] Coren: so, things left to put to graphite would be 1. grid engine stats, 2. proxy error / response time / throughput stats [20:01:17] Coren: I'll pick up (2) sometime soon [20:02:41] Coren: might not get time for the next few days, on account of $DAYJOB :) [20:06:11] Coren: I love you Saturday reply about jenkins isolation :-D [20:06:27] Coren: I am never going to be able to make a decision now hehe [20:14:20] hashar: Oh noes! Choices! [20:17:35] hey scfc_de [20:20:50] Coren: so tools-webgrid-04 is idle now, but I guess that's ok. [20:21:21] YuviPanda: You'd expect the next webservice to end up there, in theory. [20:21:28] Coren: yeah [20:21:41] Coren: also, any other metrics you think might be useful? [20:22:25] YuviPanda: I expect your gridengine collector will grab the equivalent to qstat -g c already... [20:22:43] YuviPanda: Having an idea of the mail queue lengths could be useful to see if there are issues with email [20:22:48] Coren: yeah, I expect so too. I will read through the code (and qstat manpage) before I do that [20:22:54] Coren: oh yeah, there's one for that as well [20:32:34] YuviPanda: Didn't read backlog, but if you're doing monitoring, I submitted some Bugzillas for those (mail queue lengths, etc.). [20:32:41] scfc_de: oh, links? [20:32:58] Hnm.. webservice not starting again? I'm fiddling with lighttpd a lot, so I'm basically constantly restarting [20:33:18] YuviPanda: I'll post them later; otherwise just search for reporter = me. [20:33:25] scfc_de: alright. [20:33:36] Krinkle: hmm? should be. we have a fully spare machine standing by as well. [20:33:38] 'websercive restart' is returning suspiciously fast [20:33:47] It worked 5 minutes ago [20:33:58] Krinkle: are you sure there's no errors in your lighty local conf? [20:34:17] Right, that blocks the service tiself from starting [20:34:22] thx [20:34:33] Krinkle: :) [20:34:41] found it in error.log [20:35:13] scfc_de: adding mointoring for queue size now [20:39:00] scfc_de: found https://bugzilla.wikimedia.org/show_bug.cgi?id=58871 [20:39:09] scfc_de: why should we check it on each host? isn't checking on -mail enough? [20:40:25] YuviPanda: In case the connection between host X and tools-mail is severed (temporarily) and the mails are stuck on host X. [20:40:55] scfc_de: ah, hmm. would that be useful? wouldn't the lack of mail on tools-mail alert anyway? [20:41:01] (once we have alert, in the glorious future) [20:42:49] Coren: hey, wondering about "gordon" [20:43:15] would that be just like dickson? [20:43:39] mutante: Yeah, it was intended to be a second IP allocated to the box that would have just management on it so that we could shut down everything but IRCd on the other [20:43:50] * YuviPanda wonders if gordon has flash drives [20:43:54] mutante: It's still a "live" project but it fell by the wayside. [20:43:58] YuviPanda: *g* nice one [20:44:06] As more important things popped up. [20:44:17] Coren: then i just have this comment here https://gerrit.wikimedia.org/r/#/c/115093/3/templates/155.80.208.in-addr.arpa [20:44:24] saw that DNS change , that was all [20:44:24] mutante: :) [20:44:26] thanks, gotcha [20:45:23] YuviPanda: I wasn't thinking about 1/x of all mails suddenly being stuck, but, for example due to a DNS glitch, a few dozen being stuck on a single host. [20:45:39] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=48668 is grid monitoring. [20:46:07] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=48694 is replag graphing in Ganglia (-> Graphite). Should have some alert (> x hours) as well. [20:46:31] scfc_de: right. am unsure how to do replag [20:46:41] scfc_de: or anything with the dbs at all, for that matter. they live in prod... [20:46:47] well, prod cluster [20:47:10] YuviPanda: Hm.. got a sec to look at a conf? https://gist.github.com/Krinkle/5ffb219be942757657bc The BlankPages entry is redirecting fine (it matches a literal '.' not just any character, and query string is preserved) [20:47:18] but OrphanTalk2.php entry is not fine [20:47:23] !log deployment-prep The state of puppet for beta is badly broken. I have hacked things to get puppet to apply on deployment-apache0[12] but puppet won't apply on deployment-bastion in part due to the same hacks. [20:47:25] Logged the message, Master [20:47:26] it matches any character instead of '.' and drops the query string [20:48:38] YuviPanda: Is Graphite fixed on hosts? If, I'd define a test for -login or so that just checks that. [20:48:57] scfc_de: what do you mean by 'fixed on hosts'? [20:49:19] YuviPanda: Does every alert/test/etc. must have an associated host. [20:49:28] Krinkle: hmm, unsure, sorry. haven't done much with lighty configs. my instinct is to try to remove the two variants and see if that helps [20:50:17] YuviPanda: which variants? [20:50:40] "^/krinkle-redirect/OrphanTalk2/?(\?.*)?" => "https://tools.wmflabs.org/orphantalk/$1", and "^/krinkle-redirect/OrphanTalk2\.php(\?.*)?" => "https://tools.wmflabs.org/orphantalk/$1", [20:50:54] they don't overlap [20:50:59] and they both catch query string [20:51:09] 3Wikimedia Labs / 3tools: Some issues: tools-webgrid-03/04, tools-login - 10https://bugzilla.wikimedia.org/67329 (10metatron) 3UNCO p:3Unprio s:3normal a:3Marc A. Pelletier # tools-webgrid-03 / 04 refuse shh – permission denied(publickey) # tools-login: The last Puppet run was at Mon Jun 30 10:52:39... [20:51:21] will try [20:51:29] Krinkle: possibly, yeah. as I said, I haven't really worked with lighty configs at all [20:51:36] k [20:51:59] Oh, I see now [20:52:01] it does overlap [20:52:19] I need to add a $ to one of them [20:52:23] thx [20:52:26] :D [20:53:06] It takes 2 minute each time for the webservice to come up [20:53:14] that's might kill me [20:53:57] Krinkle: ugh, for it to get scheduled and run on a box? [20:54:13] yeah, from runing the restart to responding to http from the outside [20:54:27] that's weird. I still see no slots used on webgrid-04 [20:57:59] How can I see which hosts are on which queue (i. e. webgrid)? [20:59:44] (qconf is so compartmentalized that I always find it hard to get the bigger picture.) [21:01:41] "qconf -shgrp \@webgrid" shows the hosts for that queue. Okay, gotta switch to Toolserver to pack my stuff there. Be back after midnight. [21:01:44] Ahh, finally back in the labs-channel [21:02:19] scfc_de: maybe just $ qhost -q [21:02:50] hedonil: Ah! Much better. [21:03:52] YuviPanda: wanna put some load on -04, but need ssh access to have an eye on that [21:03:59] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=67329#add_comment [21:04:19] hedonil: I'm unsure how to do that :| [21:04:48] YuviPanda: is the labs-god around? [21:05:01] * YuviPanda pokes Coren for hedonil [21:05:05] hehe [21:06:12] How to do what exactly? [21:06:23] Coren: let others ssh into tools-webgrid-03 and -4 [21:06:30] I thought that would've been part of the puppet role [21:07:33] YuviPanda: It is, but it needs about 4 puppet runs to take because of the no puppetdb issue. (the time it takes for host keys to get propagated) [21:07:40] heh [21:08:14] Oh! And you have to set ssh_hba to 'yes' in the puppet config on wikitech. Forgot that! [21:08:29] Hm.. some web requests are easily taking 2-5 seconds for me. Even though they're all handled from lighttpd conf (no cgi or file system) [21:08:37] Guess there's a big load? [21:08:48] Krinkle: hmm, shouldn't be, no. [21:09:09] BTW, /var on tools-exec-01 is full. [21:09:16] bah [21:09:49] scfc_de: need to make diamond rotate more aggressively, and also setup biglog [21:10:33] Krinkle: assuming this is webgrid-03, that box also has minimal load [21:11:40] Coren: any objections to me applying biglogs to -exec-01 and moving things around? [21:11:53] Coren: it doesn't have anything logging there that would be problematic [21:12:15] YuviPanda: If you feel confident, sure. [21:12:24] Coren: alright, let me do that and see what breaks :) [21:13:46] YuviPanda: Hm.. any request to https:.../ gets redirected to http: // (trailing slash enforcement) [21:13:54] sumanah, sorry, saw it now [21:14:06] I can't find links to live sites, only to relelase notes [21:14:13] Krinkle: yeah, there's a bug for that, presently unfixed :( [21:14:35] I suppose there isn't an easy way to make a regex match group in lighttpd default to a trailing slash if none there [21:15:00] Right now I'm using the (\?.*|/.*|$) pattern [21:15:03] "^/krinkle-redirect/OrphanTalk2(\?.*|/.*|$)" => "https://tools.wmflabs.org/orphantalk$1", [21:15:32] so that /foo, foo/, foo/?x=u, foo?x=u and foo/filname gets matched [21:15:34] but not foobar [21:16:18] oh well, I'll let this slide for now [21:16:35] Krinkle: yeah, slash handling needs to be fixed at the proxy level [21:17:06] 3Wikimedia Labs / 3tools: /etc/cron.d/nonfs is constantly failing on tools-webproxy - 10https://bugzilla.wikimedia.org/58067#c1 (10Tim Landscheidt) 5NEW>3RESO/INV No longer current. [21:18:14] Coren: scfc_de applied biglogs to exec-01 :) [21:18:18] nothing seems to have broken [21:19:11] Except that now you may well have growing logs under /var/log that are filling up /var and can't be truncated. [21:19:22] Time will tell. [21:19:50] Coren: indeed, but the only thing running (other than puppet) that seemed to be actively writing logs was diamond, and I restarted it. need to probably restart sshd too, I guess. [21:21:28] Coren: I restarted sshd as well. [21:21:52] That may suffice. It almost certainly will in the medium term. [21:25:14] Coren: yeah :) [21:26:16] Coren: and while I have you, +2 https://gerrit.wikimedia.org/r/#/c/143111/ ? :) [21:26:51] Coren: ty [21:32:21] 3Wikimedia Labs / 3Infrastructure: puppet labsstatus not reported when using role::puppet::self - 10https://bugzilla.wikimedia.org/63296 (10Bryan Davis) [21:42:58] scfc_de: hmm, exim's queue size is 17. I wonder if it should be non-zezro [21:43:27] Coren: https://gerrit.wikimedia.org/r/#/c/143165/ :) (I manually did that on tools-mail just now) [21:44:38] YuviPanda: For Nagios there are some plugins that measure queue by time, i. e. x messages not older than y minutes are okay, but > z minutes there must be none. That's the logic I would probably adapt. [21:51:21] 3Tool Labs tools / 3[other]: Migrate to Tool Labs: https://toolserver.org/~mzmcbride/yanker/ - 10https://bugzilla.wikimedia.org/61039#c2 (10PiRSquared17) 5NEW>3RESO/FIX https://tools.wmflabs.org/pirsquared/ts_archive/mzmcbride/yanker.py [21:56:46] Coren: YuviPanda: Hmm, the 4x puppet-run-period should be over for webgrid-03 right now – still no ssh access [21:57:24] coren YuviPanda: Or is it stalled like on webgrid-02 ? [21:57:26] The last Puppet run was at Sun Jun 29 16:28:10 UTC 2014 (1764 minutes ago) [21:57:38] wait what [22:01:05] hedonil: dear lord, it's been forever since puppet ran there [22:01:06] eugh [22:01:18] YuviPanda: yeah [22:01:33] !log tools removed stale lockfile for puppet, forcing run [22:01:35] Logged the message, Master [22:02:31] YuviPanda: similar thing on tools-login: The last Puppet run was at Mon Jun 30 10:52:39 UTC 2014 (669 minutes ago) [22:03:06] hedonil: how are you getting these numbers, btw? [22:03:53] YuviPanda: just an open eye while accessing resources... [22:05:11] YuviPanda: but maybe your new comprehensive monitoring will objectify this in the near future ;) [22:05:14] hedonil: what does webgrid-02 report now? [22:05:35] The last Puppet run was at Mon Jun 30 22:01:37 UTC 2014 (3 minutes ago) [22:05:40] hedonil: cool [22:05:48] hedonil: forcing it on -login now [22:06:41] YuviPanda: heh, while we are on it... [22:06:42] !log tools stale lockfile in tools-login as well, removing and forcing puppet run [22:06:44] Logged the message, Master [22:06:56] can you take a look at http://tools.wmflabs.org/enwp10/hello.php [22:07:06] 3Wikimedia Labs / 3Infrastructure: puppet labsstatus not reported when using role::puppet::self - 10https://bugzilla.wikimedia.org/63296#c1 (10Andrew Bogott) The check is easy, it's done by a reporter that's on every puppetmaster. Currently, though, that fact is relayed to wikitech via nova instance metadat... [22:07:19] it's 404, but Ican see no webservices running [22:07:39] hedonil: well, if there are no services running, it should be 404... [22:08:22] YuviPanda: afaik, it should return something like ...not serviced atm [22:08:31] hedonil: yeah, unsure how you're getting this [22:08:33] hedonil: file a bug? [22:08:50] hedonil: I've been spending far too much time on labs, my $DAYJOB (the wikipedia android app) needs some love too [22:09:26] I think it just need some restart [22:09:33] hedonil: possibly. [22:10:09] YuviPanda: and without love, things wither ... [22:10:23] hedonil: I ran a webservice start [22:10:32] YuviPanda: cool [22:10:39] !log tools ran webservice start for enwp10 [22:10:42] Logged the message, Master [22:10:57] YuviPanda: The last Puppet run is reported on each login. Too-long should appear in Special:*Instances as "stale", but I don't know what the threshold is there. [22:11:09] hedonil: https://graphite.wmflabs.org if you haven't seen :) [22:11:42] YuviPanda: ...heard some interesting rumors about it :-) [22:11:54] hedonil: this is completely new, reborn from the old graphite :) [22:12:01] hedonil: actually has intresting data and is up to date :) [22:12:10] hedonil: I shall email las-l tomorrow [22:12:11] sounds really, really great [22:12:11] *labs [22:12:32] hedonil: and is going to get more data soon, like grid engine stats and non-200 responses from the proxy [22:12:37] yeah, monitor *ALL* the things [22:13:03] scfc_de: yeah, I think I could write a simple diamond collector that collects (time since last successful puppet run) :) [22:13:29] YuviPanda: btw, I'll put a note about revived enwp10 tool here: https://www.mediawiki.org/wiki/Tool_Labs/Collection_of_issues_after_Toolserver_shutdown [22:13:36] hedonil: ty :) [22:13:55] 404 means the webserver /is/ running. [22:14:20] Coren: no overrides from the proxy were being caught, which was weird. restarting fixed [22:14:33] Weird indeed. [22:14:36] Coren: also no, diamond can't sudo -u Debian-exim easiy. I tried that, fails. Will file a bug upstream tomorrow. [22:14:53] Coren: I also removed the defunct ganglia monitor - it seems to refer to files that dont' exist anymore, and was failing anyway [22:16:36] Coren: https://gerrit.wikimedia.org/r/#/c/143167/ rips out the ganglia bit [22:17:16] Coren: ty [22:28:51] 3Wikimedia Labs / 3Infrastructure: puppet labsstatus not reported when using role::puppet::self - 10https://bugzilla.wikimedia.org/63296#c2 (10Antoine "hashar" Musso) Can we have a custom reporter to output the status of each nodes locally in a flat file? We could then fetch and render it somewhere. [22:31:11] 3Wikimedia Labs / 3Infrastructure: puppet labsstatus not reported when using role::puppet::self - 10https://bugzilla.wikimedia.org/63296#c3 (10Andrew Bogott) Yes, be my guest :) The current reporter is in modules/puppetmaster/lib/puppet/reports, it's pretty straightforward. Note, though, that we're in the... [22:41:06] 3Wikimedia Labs / 3Infrastructure: puppet labsstatus not reported when using role::puppet::self - 10https://bugzilla.wikimedia.org/63296 (10Greg Grossmeier) [22:41:09] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333 (10Greg Grossmeier) 3NEW p:3Unprio s:3normal a:3None Sometimes puppet breaks, it happens, but we need to know when it happens in Beta Cluster. Something sim... [22:42:08] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333 (10Greg Grossmeier) p:5Unprio>3High [22:54:51] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333#c1 (10Yuvi Panda) Sadly we can't really use icinga properly on labs (so I'm told, due to the way resource collection works with puppet). Also the prod icinga stuff i... [23:01:06] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333#c2 (10Bryan Davis) One thing we could do in beta would be to send puppet logs to the beta logstash instance and then announce to #wikimedia-labs and/or #wikimedia-qa... [23:06:23] Krinkle: hey! is zuul only in prod or is it running in labs as well? [23:06:44] Depends on your definition [23:06:50] zuul, the software, is only run in prod. [23:07:07] There might be a test instance, but it'd be a temporary one that antoine uses to try out updates for prductoin [23:07:15] Why? [23:07:38] Krinkle: right, because I'm adding a diamond collector that'll require python-yaml, and zuul is the only other manifest that defines it [23:08:01] Krinkle: so if I don't add some other way of overcoming this, puppet will fail on instances where the zuul role is applied [23:08:28] yea, don't worry [23:08:50] the integration- instance that uses it (if there is one) probably isn't puppetized anyway [23:08:59] and if it is, I'm not worried about it breaking [23:09:00] Krinkle: but there's a zuul module... [23:09:08] Krinkle: ah, ok then :) I'll just ignore that for now. [23:09:13] hmm, actually, I probably can't [23:09:30] bd808: is role/db used in betalabs? [23:09:30] I mean, afaik we just re-create that instance anyway whenever we go a try an update [23:09:38] so if it still exists, it's unused. [23:10:10] Krinkle: right. I also found them in role/db and statistics, so can't ignore those. [23:10:22] labs statistics? [23:10:33] YuviPanda: Dunno. [23:10:37] Krinkle: no statistics role in labs, but I don't know about role/db [23:10:47] bd808: hmm, ok [23:11:01] YuviPanda: I was unable to parse that previous line. [23:11:05] you found it in what and what? [23:11:23] Krinkle: in manifests/role/db.pp and in manifests/misc/statistics.pp [23:11:35] if either of those are applied anywhere in labs, the puppet run will fail [23:11:46] YuviPanda: We apparently use role::mariadb::beta for the db1 server [23:11:59] YuviPanda: You found diamond there, not zuul, right? [23:12:18] oh, not even diamong [23:12:19] just python-yaml [23:12:20] ok [23:12:29] Krinkle: yeah :) [23:12:35] why would it fail? [23:12:36] no cause for alarm [23:12:45] its' just an ensured package? [23:12:51] bd808: nevermind, narrowed it down. [23:13:03] Krinkle: yeah, and if you have two package definitions set to ensure the same package puppet will fail [23:13:11] because puppet is stupid [23:13:33] right, package is the resources, not the reference [23:13:37] Krinkle: right [23:13:39] Ugh [23:13:49] anyway, narrowed it down. the db role only is on tin, so it's allright [23:13:53] How do we do that now if two roles both depend on a package? [23:14:13] I'm sure we have lots of packages that depend on php5-cli or something else generic [23:14:47] Krinkle: right, so you've to refactor the packages into something generic and include it [23:14:54] So basically generic re-usable puppet modules (not wmf specific) can't exist because of conflicts like these? [23:15:01] Krinkle: orr.... use ensure_packages everywhere, which is a recent puppet stdlib addition [23:15:08] Krinkle: pretty much. puppet and reusability are kinda fuckall [23:15:23] 2015 - migrate to chef? [23:15:25] Krinkle: but ensure_packages is a solution, but that also means you've to use it everywhere. ensure_packages will still conflict with a package {} [23:15:39] hehe [23:16:22] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333#c3 (10Yuvi Panda) https://gerrit.wikimedia.org/r/#/c/143193/ will log puppetagent metrics into labs graphite, including last run time. [23:16:39] Coren: https://gerrit.wikimedia.org/r/#/c/143193/ I think I put it in the right place for 'all labs nodes' [23:17:44] YuviPanda: Just for future reference, when I want to know if a class is applied in beta I run something like `salt '*' cmd.run 'grep role::db /var/lib/puppet/state/classes.txt'` from deployment-salt. [23:18:06] bd808: ah, didn't know of classes.txt. /var/lib/puppet/state is something I was just exploring, seems quite useful [23:53:39] Coren: hmm, webgrid-04 still has no jobs. [23:53:45] I don't think I botched anything :) [23:53:56] shall check out tomorrow