[00:28:11] My tool's been down for a few days (http://tools.wmflabs.org/traffic-grapher/), think it could be restarted? [00:37:13] Kevin-Payravi: I bet but I believe that it is up to you to do [00:58:07] @Negative24 Got it working, thanks. Had to start the webservice. [00:58:24] Kevin-Payravi: yeah [01:29:03] 10Tool-Labs, 10Wikidata: Lost connection to MariaDB server during query - https://phabricator.wikimedia.org/T76699#1181127 (10scfc) 5Open>3declined I understand this task to be "bring back uninterrupted connections", and this seems to be off the table. Increasing the timeout from one minute to two is, as... [01:35:13] 10Tool-Labs: Toolserver redirect configuration broken after domain move - https://phabricator.wikimedia.org/T85166#1181147 (10scfc) 5Open>3Resolved The two URLs mentioned in this task now redirect properly, and putting the configuration into a Git repository is covered by T85165. I assume the column change... [01:51:30] 10Tool-Labs, 10Pywikibot-compat-to-core, 10pywikibot-compat, 10pywikibot-core: patrol.py depends on mwlib.uparser not available on wmflabs - https://phabricator.wikimedia.org/T71980#1181157 (10scfc) It's unclear to me what the purpose of this task is. Is it addressed to the #pywikibot-core team to replace... [01:55:00] 10Tool-Labs, 3ToolLabs-Goals-Q4: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1181160 (10scfc) [01:55:02] 10Tool-Labs, 5Patch-For-Review: Unify proxylistener interacting code across portgrabber / tool-nodejs / tool-uwsgi - https://phabricator.wikimedia.org/T91957#1181159 (10scfc) [01:57:56] YuviPanda, you around? [02:06:15] tools-bastion-01 sure is having trouble right now [02:08:55] YuviPanda, I *may* have figured it out by copying my .git directory [02:09:00] I will let you know if not [02:33:46] 10Tool-Labs, 10Pywikibot-compat-to-core, 10pywikibot-compat, 10pywikibot-core: patrol.py depends on mwlib.uparser not available on wmflabs - https://phabricator.wikimedia.org/T71980#1181193 (10Ricordisamoa) I suggest to (close this task as declined and create a new one|repurpose this task) for converting p... [03:58:05] 10Tool-Labs: openbadges.org not connecting to tool labs - https://phabricator.wikimedia.org/T94332#1181294 (10scfc) 5Open>3declined a:3scfc `tools.wmflabs.org`, like Wikipedia & Co., blocks requests without a `User-Agent` header (cf. http://git.wikimedia.org/blob/operations%2Fpuppet.git/production/modules%... [03:59:42] 10Tool-Labs: openbadges.org not connecting to tool labs - https://phabricator.wikimedia.org/T94332#1181298 (10Lixxx235) Thanks. I suspected that was the issue but wasn't sure. [04:01:25] 6Labs, 10Tool-Labs: Request to be added as maintainer to abandoned bibleversefinder/ tool - https://phabricator.wikimedia.org/T91585#1181299 (10scfc) p:5Triage>3Low [04:02:09] 10Tool-Labs, 6Engineering-Community, 6WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#1181302 (10scfc) [04:02:12] 6Labs, 10Tool-Labs: Request to be added as maintainer to abandoned bibleversefinder/ tool - https://phabricator.wikimedia.org/T91585#1090568 (10scfc) [04:22:02] 6Labs, 10Tool-Labs: Unable to "Create New Tool" from tools.wmflabs.org webpage - https://phabricator.wikimedia.org/T91246#1181306 (10scfc) @ananthrk, could you please retry creating a tool to see if that was just a transient failure? ("test-ananthrk" or something similar maybe.) [04:46:55] 6Labs, 10Tool-Labs: Unable to "Create New Tool" from tools.wmflabs.org webpage - https://phabricator.wikimedia.org/T91246#1181327 (10scfc) p:5Normal>3Low [05:01:29] 10Tool-Labs: Clean up list of projects on Tool Labs home page and add Tomcat tools - https://phabricator.wikimedia.org/T51937#1181339 (10Ricordisamoa) As I wrote at https://lists.wikimedia.org/pipermail/labs-l/2015-March/003462.html, I'd like to take on Hedonil's defunct https://tools.wmflabs.org/directory/ as a... [05:05:19] 10Tool-Labs: bigbrother doesn't know how to manage uwsgi-python webservers and other new webservice2 functionality - https://phabricator.wikimedia.org/T94496#1181343 (10scfc) a:3scfc (There is a typo in your description: It's `uwsgi-python`, not `uswgi-python`.) I think the intended syntaxes are: ``` webserv... [05:26:21] 10Tool-Labs: bigbrother doesn't know how to manage uwsgi-python webservers and other new webservice2 functionality - https://phabricator.wikimedia.org/T94496#1181345 (10scfc) Actually, it is also intended to support the syntax: ``` webservice -uwsgi-python ``` (But that doesn't work for the same reason.) [05:29:12] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, 10Tool-Labs-tools-Article-request, and 9 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1181346 (10faidon) Ping? [05:47:33] 6Labs: /etc/ssh/userkeys/ubuntu notices for every puppet run on labs instances - https://phabricator.wikimedia.org/T94866#1181386 (10faidon) I'd rather not use `force => true` in puppet for this, if possible. This can be done massively with salt, no? I still haven't figured out where these files came from, thou... [05:53:45] 10Tool-Labs: bigbrother doesn't stop - https://phabricator.wikimedia.org/T94500#1181392 (10scfc) p:5High>3Lowest `bigbrother` is structured in a way that makes it almost impossible to solve this issue and similar ones (cf. T88122) without effectively rewriting it. Conveniently, @yuvipanda currently does tha... [05:54:53] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1181397 (10scfc) p:5Triage>3Normal [05:55:14] 10Tool-Labs: bigbrother only watches users jobs if they already have a job running - https://phabricator.wikimedia.org/T88122#1181398 (10scfc) p:5Triage>3Lowest [06:41:17] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Make sure tools-db is backed up in some form - https://phabricator.wikimedia.org/T88716#1181437 (10scfc) What is the purpose of the backups? If it is to guard against hardware failures & Co., I assume some replication to another server that can be pointed to by `tools... [06:43:39] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [06:47:06] 10Tool-Labs: Investigate system-level packages - https://phabricator.wikimedia.org/T91877#1181445 (10scfc) @Petrb: Were there any problems that necessitated those updates, or can we return to the Precise defaults (for Precise instances)? [06:50:15] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1181457 (10scfc) p:5Triage>3Lowest [06:56:40] 10Tool-Labs, 10Wikidata: Lost connection to MariaDB server during query - https://phabricator.wikimedia.org/T76699#1181470 (10Multichill) 5declined>3Open [06:57:02] 10Tool-Labs: log files not written - https://phabricator.wikimedia.org/T85775#1181471 (10scfc) 5Open>3Invalid a:3scfc It's unclear to me if there is an issue at all, and if yes, if it is caused by the Tools infrastructure. Please reopen if you encounter new problems. [06:57:04] 10Tool-Labs, 10Wikidata: Lost connection to MariaDB server during query - https://phabricator.wikimedia.org/T76699#810101 (10Multichill) Just declining (lack of) usability bugs is not the way to go forward. [06:59:05] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [06:59:08] 10Tool-Labs: Toolserver redirect configuration broken after domain move - https://phabricator.wikimedia.org/T85166#1181476 (10Multichill) 5Resolved>3Open [06:59:25] 10Tool-Labs, 3ToolLabs-Goals-Q4: Put toolserver.org redirect configuration in git - https://phabricator.wikimedia.org/T85165#1181478 (10Multichill) [06:59:27] 10Tool-Labs: Toolserver redirect configuration broken after domain move - https://phabricator.wikimedia.org/T85166#940431 (10Multichill) [06:59:34] 10Tool-Labs: Toolserver redirect configuration broken after domain move - https://phabricator.wikimedia.org/T85166#940431 (10Multichill) 5Open>3stalled [06:59:46] 10Tool-Labs: Toolserver redirect configuration broken after domain move - https://phabricator.wikimedia.org/T85166#940431 (10Multichill) This is not resolved, just stalled. [07:07:57] !log tools.pywikibot puzzled by the "/data/project/pywikibot//data/project/pywikibot/nightly-source/nightly: No such file or directory" emails. The crontab looks ok with "jsub -once -mem 1g /data/project/pywikibot/nightly-source/nightly > /dev/null" [07:07:59] Logged the message, Master [07:08:41] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [07:11:18] !log chmod +x /data/project/pywikibot/nightly-source/nightly so I can do a manual run [07:11:18] chmod is not a valid project. [07:13:19] !log tools.pywikibot chmod +x /data/project/pywikibot/nightly-source/nightly so I can do a manual run. Amir1 changed something to make it fail automagically, but forgot to log it [07:13:21] Logged the message, Master [07:13:51] It shouldn't fail [07:14:00] I fix it, give me one hour [07:14:03] 10Tool-Labs, 3ToolLabs-Goals-Q4: Provide a status page (list) of all active proxy definitions - https://phabricator.wikimedia.org/T88216#1181483 (10scfc) p:5Triage>3High [07:14:04] multichill: ^ [07:14:37] Amir1: And update https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.pywikibot/SAL with what you've done please [07:14:54] sure [07:15:49] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make list.php not rely on portgranter - https://phabricator.wikimedia.org/T93197#1181492 (10scfc) p:5Low>3High [07:17:37] multichill: can you forward me the e-mails? [07:17:51] You get them yourself [07:18:06] The contents is only "/data/project/pywikibot//data/project/pywikibot/nightly-source/nightly: No such file or directory" [07:18:07] I think I made filters in my e-mail [07:18:28] btw: I added John and Merlijn to the github repo, I want to add you too, Is it okay? [07:18:31] These end up in my guess-what-labs-is-broken-again-folder [07:18:39] sure [07:18:41] :))) [07:19:10] I did a chmod +x, not sure if that fixes it. I'm not sure jsub is able to run stuff that is -x [07:20:23] done, you can push your changes to the nightly creator [07:20:29] https://github.com/Ladsgroup/Pywikibot-nightly-creator [07:23:57] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [07:26:21] multichill: hmm, it seems it's not because of my changes [07:26:29] I changed everything [07:27:38] maybe it happened due to lack of -N name [07:27:59] I added it and I hope it doesn't return anything [07:28:52] 10Tool-Labs: Add script_path to meta_p.wiki database - https://phabricator.wikimedia.org/T93483#1181497 (10scfc) p:5Triage>3Low Bytes are cheap; is it useful from a MediaWiki perspective to promote: "To reach the API, take `$url` and append `$script_path` and `/api.php`", or does it make more sense to say: "... [07:30:36] 10Tool-Labs: Reduce amount of tools-local packages - https://phabricator.wikimedia.org/T91874#1181500 (10scfc) p:5Triage>3Low [07:31:31] 10Tool-Labs: Clean out unused security groups on toollabs - https://phabricator.wikimedia.org/T91619#1181502 (10scfc) p:5Triage>3Low [07:36:28] 10Tool-Labs, 3ToolLabs-Goals-Q4: Monitor that proxylistener is accepting new connections - https://phabricator.wikimedia.org/T91958#1181504 (10scfc) p:5Triage>3Low [07:36:50] 10Tool-Labs: Rename 'misctools' toollabs package to something more appropriate - https://phabricator.wikimedia.org/T91879#1181506 (10scfc) p:5Triage>3Lowest [07:38:59] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-login / bastion hosts redundant and move them to trusty - https://phabricator.wikimedia.org/T91863#1181508 (10scfc) p:5Triage>3Low [07:39:08] anyone from labs admins around? [07:41:41] I created a system to clone [https://github.com/Ladsgroup/Pywikibot-nightly-creator Pywikibot-nightly-creator] from github before every run. [07:41:55] logged in SAL [07:46:30] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1181513 (10scfc) p:5Triage>3Low This needs a "plan" with rules that can be (perhaps even in the form of a script) checked: # `tools-master`... [07:47:32] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1181516 (10scfc) p:5Triage>3Normal [07:48:28] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Set up a schedule for doing failover exercises for toollabs - https://phabricator.wikimedia.org/T91068#1181518 (10scfc) p:5Triage>3Normal [07:48:53] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1061721 (10scfc) [07:48:54] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Set up a schedule for doing failover exercises for toollabs - https://phabricator.wikimedia.org/T91068#1073400 (10scfc) [07:53:26] 10Tool-Labs: Toolserver tilserver (toolserver.org/tiles/) redirect - https://phabricator.wikimedia.org/T86739#1181527 (10scfc) 5Open>3Resolved a:3scfc http://toolserver.org/tiles/hikebike/8/205/135.png now redirects to http://a.tiles.wmflabs.org/hikebike/8/205/135.png. [07:53:36] 10Tool-Labs: Toolserver tilserver (toolserver.org/tiles/) redirect - https://phabricator.wikimedia.org/T86739#1181530 (10scfc) a:5scfc>3None [07:53:51] anyone able to restart ia-upload webservice? [07:54:28] YuviPanda: able to restart ia-upload webservice? [07:57:00] 10Tool-Labs: Shorten update interval of lighttpd error logs - https://phabricator.wikimedia.org/T87562#1181532 (10scfc) 5Open>3Resolved a:3scfc I assume this is resolved by fulfilling "there should be some way to access the tool error log directly". [07:57:18] 10Tool-Labs: Shorten update interval of lighttpd error logs - https://phabricator.wikimedia.org/T87562#1181535 (10scfc) a:5scfc>3valhallasw [08:01:15] 10Tool-Labs, 10Programs-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1181554 (10Multichill) 3NEW [08:02:56] Amir1: You can use !log to update https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.pywikibot/SAL ;-) [08:04:23] 10Tool-Labs: Provide regular report on tools with single owner - https://phabricator.wikimedia.org/T86432#1181567 (10scfc) @valhallasw: The technical side of this is very easy: Just look for tool groups with only one member :-). But do you mean that there should be a web page listing those tools, a wiki page re... [08:05:58] 10Tool-Labs: Support @weekly et al in crontab - https://phabricator.wikimedia.org/T86446#1181570 (10scfc) p:5Triage>3Lowest [08:08:10] 6Labs, 10Tool-Labs: Fix Labs' PAM config mess - https://phabricator.wikimedia.org/T85910#1181576 (10scfc) [08:08:39] 10Tool-Labs: Document and make it easy for people to request new packages in toollabs - https://phabricator.wikimedia.org/T1101#1181577 (10scfc) p:5Triage>3Lowest [08:10:52] 10Tool-Labs, 3ToolLabs-Goals-Q4: Show replication lags in Graphite - https://phabricator.wikimedia.org/T50694#1181580 (10scfc) a:5coren>3None [08:11:16] is scfc around here? [08:12:56] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Ensure that all running webservices have a services.manifest file - https://phabricator.wikimedia.org/T95095#1181585 (10scfc) p:5Triage>3Normal [08:14:01] 10Tool-Labs: Bigbrother should ignore empty lines in .bigbrotherrc - https://phabricator.wikimedia.org/T94990#1181590 (10scfc) p:5Triage>3Low a:3scfc [08:14:05] paravoid: nope, he has decided to stay off IRC [08:14:14] paravoid: emailing him usually gets a quick response [08:14:20] too bad [08:15:01] paravoid: yeah, but if it’s between him not doing much for tools + being on IRC vs doing stuff + not being on IRC, I’ll take the latter :D And that’s what it seems to have come down to... [08:17:32] 6Labs, 10Tool-Labs: Provide 'Support request' tool labs project - https://phabricator.wikimedia.org/T94359#1181599 (10scfc) p:5Triage>3Normal [08:18:54] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make webservice default to trusty on toollabs - https://phabricator.wikimedia.org/T94788#1181602 (10scfc) p:5Triage>3Low [08:20:29] 10Tool-Labs, 3ToolLabs-Goals-Q4, 3ToolLabs-Q4-Sprint-1: Explicitly define all the services that Tool Labs provides and their interfaces - https://phabricator.wikimedia.org/T93622#1181614 (10scfc) p:5Triage>3Normal [08:23:26] (03PS1) 10Ricordisamoa: Status: link to the tool page instead of the list page with anchor [labs/toollabs] - 10https://gerrit.wikimedia.org/r/202007 [08:26:16] (03CR) 10Yuvipanda: [C: 032] "oh god yes" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/202007 (owner: 10Ricordisamoa) [08:28:46] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make webservice default to trusty on toollabs - https://phabricator.wikimedia.org/T94788#1181631 (10yuvipanda) p:5Low>3Normal [08:37:07] YuviPanda: Check out /home/multichill/bin/weball.sh , I'm prepared for the next crash :P [08:38:00] https://phabricator.wikimedia.org/T90561 is making good progress and I hope to roll it out next week [08:38:21] * YuviPanda goes to sleep now [08:40:15] Oh right, forgot you're on SF times these days [08:40:22] Good night YuviPanda [08:40:35] multichill: :) [08:40:55] multichill: if you did a webservice start or stop over the last week or two, you’d find that it leaves behind a ‘service.manifest’ in your toolhome [08:41:13] Looks good. I say, Nagios integration and notification is next ;-) [08:41:16] so hopefully it’ll be all much better, and bigbrother can be put out to pasture in a couple of weeks [08:41:19] well [08:41:27] Proper parent slaving so you only get notified when only your tool dies :P [08:41:34] downtimes, you mean :P [08:41:39] no nagios, I think [08:41:49] there’s comments on https://phabricator.wikimedia.org/T90561 [08:41:55] mostly between me and scfc [08:42:44] whatever I end up calling it will email you if it finds out your tool is down [08:42:59] but I also plan on allowing people to define more monitoring endpoints to alert them about in their service manifests [08:43:07] dependencies are gonna be hard tho [08:43:12] That would be nice [08:43:18] but that’s not going to happen for at least another month or so [08:43:23] plenty of other stuff for us to do [08:43:30] https://phabricator.wikimedia.org/project/sprint/board/1139/query/open/ [08:43:33] and that [08:43:36] ’s not even complete [08:43:43] Better busy than bored [08:44:01] heh [08:44:10] NFS sitll remains a huge SPOF [08:44:16] and I’ve no solution for that as such, tho [08:45:55] YuviPanda: I think my grid ex-coworkers switched all their stuff to https://en.wikipedia.org/wiki/Ceph_%28software%29 [08:46:20] we tried ceph a few years ago for thumbnail storage, I think... [08:46:22] didn’t work out great [08:46:29] That's performance storage [08:46:31] but we will probably give it a go again at some point in the near future [08:46:40] Interesting. [08:46:41] once we get rid of wikitech [08:46:56] It aint fast, but it aint slow either and it is reliable [08:47:14] And you can just add and remove hardware [08:47:19] Why did it not work out ? [08:47:26] Hi YuviPanda [08:47:28] I don’t know, I wasn’t around at that time :) [08:47:34] andrewbogott might know more when he’s here [08:47:38] and multichill :) [08:47:40] Ok. [08:48:01] I am just back after a good 3 day break :) [08:48:11] :) [08:48:17] https://ganglia.surfsara.nl/?c=Grid%20Service%20Cluster&m=load_one&r=hour&s=by%20name&hc=4&mc=2 <- hmm, the only host with ceph in it's name seems to be down [08:48:20] I have to go to sleep soon, I’ve to go to an ‘office’ these days [08:48:40] YuviPanda: We went from church to a resort in Mahabalipurm so had a nice weekend. [08:48:43] YuviPanda: ok. [08:48:47] nice [08:49:12] YuviPanda: Save https://www.os3.nl/_media/2012-2013/courses/rp1/p04_presentation.pdf for later [08:49:14] YuviPanda: take rest.bye. [08:49:22] That's their initial try presentation [08:50:06] Have to run to a meeting. [08:50:15] And the report is at https://staff.science.uva.nl/c.t.a.m.delaat/rp/2012-2013/p04/report.pdf [08:51:14] Monday is a national holiday here :-) [08:51:31] 10Tool-Labs, 10Wikidata: Lost connection to MariaDB server during query - https://phabricator.wikimedia.org/T76699#1181639 (10scfc) Oh, if I had only known that me summarizing the comments in this task and thinking about the different implications of different approaches could be shortened to "just declining",... [08:52:03] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [09:14:50] 10Tool-Labs, 10Programs-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1181673 (10Ricordisamoa) @Multichill by "brother" you meant "broader", didn't you? [09:17:06] RECOVERY - Puppet failure on tools-webproxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:21:48] 6Labs, 10Tool-Labs: WIWOSM not working in Wikipedias - https://phabricator.wikimedia.org/T87038#1181707 (10scfc) 5Open>3Resolved a:3scfc # When I go to https://ca.wikipedia.org/wiki/G%C3%B3sol and click "(mapa)", a working OSM map appears. # When I go to https://ca.wikipedia.org/wiki/Categoria:Geografia_... [09:21:56] 6Labs, 10Tool-Labs: WIWOSM not working in Wikipedias - https://phabricator.wikimedia.org/T87038#1181711 (10scfc) a:5scfc>3None [09:32:01] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1181746 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Date:... [09:32:34] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1181747 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Date:... [09:50:49] 10Tool-Labs: Toolserver redirect configuration broken after domain move - https://phabricator.wikimedia.org/T85166#1181757 (10scfc) If you are keen on not saying much, you don't have to say anything. The information that you think "this is not resolved, just stalled" is already conveyed by Phabricator: {F109156} [10:11:54] 10Tool-Labs, 10Programs-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1181767 (10Multichill) [10:31:06] 10Tool-Labs: Document connecting to labsdb from outside of labs - https://phabricator.wikimedia.org/T85294#1181803 (10scfc) 5Open>3Resolved a:3scfc [[https://wikitech.wikimedia.org/w/index.php?title=Help:Tool_Labs/Database&diff=152321&oldid=145713|Done]]. [10:33:24] 10Tool-Labs, 10Pywikibot-compat-to-core, 10pywikibot-core: Install all pywikibot python dependencies on tool labs - https://phabricator.wikimedia.org/T86015#1181815 (10scfc) p:5Triage>3Low [10:35:15] 10Tool-Labs: qstat doesn't always work - https://phabricator.wikimedia.org/T85774#1181822 (10scfc) Could you please note the job numbers of the started jobs so that we can debug this? [10:35:25] 10Tool-Labs: qstat doesn't always work - https://phabricator.wikimedia.org/T85774#1181824 (10scfc) p:5Triage>3Normal [10:37:22] 10Tool-Labs, 10Wikimedia-Labs-Infrastructure: Make (redacted) log_search table available on ToolLabs - https://phabricator.wikimedia.org/T85756#1181832 (10scfc) p:5Triage>3Normal [10:39:36] 10Tool-Labs: Return human-readable 404 for non-existing projects - https://phabricator.wikimedia.org/T85738#1181843 (10scfc) p:5Triage>3Low [10:49:20] According to Help:SSH Fingerprints, bastion has fingerprint 9d:48:7e:d8:89:49:0f:2d:39:6d:af:5e:23:02:aa:f7, but when I connect through SSH to bastion.wmflabs.org, ssh says that it is ea:b9:9f:e7:22:f9:94:18:8c:98:1d:69:c9:40:a1:a7. Is the help page out of date? [10:51:04] (bastion.wmflabs.org resolves to 208.80.155.129 for me - I hope this is at least correct?) [10:58:41] 10Tool-Labs: Return human-readable 404 for non-existing projects - https://phabricator.wikimedia.org/T85738#1181886 (10scfc) Oh, yes. Related, but not necessarily intended for humans, are the 403s depending on the `User-Agent` header. [11:02:36] 10Tool-Labs, 7Puppet: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733#1181911 (10scfc) p:5Triage>3Normal [11:03:07] 10Tool-Labs, 10Wikimedia-Labs-General: labswiki isn't replicated on Labs - https://phabricator.wikimedia.org/T89548#1181914 (10scfc) p:5Triage>3Normal [11:03:49] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1181917 (10scfc) p:5Triage>3Normal [11:04:09] 10Tool-Labs: Webservice start failing with duplicate port allocation from portgranter - https://phabricator.wikimedia.org/T93875#1181925 (10scfc) p:5Triage>3Normal [11:04:27] 10Tool-Labs: Memory Exhausted Near / Tool labs error while querying with Python - https://phabricator.wikimedia.org/T93074#1181926 (10scfc) p:5Triage>3Low [11:04:38] 10Tool-Labs, 3ToolLabs-Goals-Q4: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1181928 (10scfc) p:5Triage>3Normal [11:04:54] 10Tool-Labs, 10Wikimedia-Hackathon-2015, 7Documentation: Re-organize Tool Labs documentation - https://phabricator.wikimedia.org/T91509#1181929 (10scfc) p:5Triage>3Low [11:06:51] 6Labs, 10Tool-Labs, 10Analytics: Make anonymized clickstream data available to the public - https://phabricator.wikimedia.org/T91495#1181937 (10scfc) p:5Triage>3Normal [11:07:13] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Have bigbrother run on multiple nodes to provide redundancy against tools-submit failure - https://phabricator.wikimedia.org/T91237#1181939 (10scfc) p:5Triage>3Lowest [11:07:40] 6Labs, 10Tool-Labs: Investigate OOMs in trusty webgrid nodes - https://phabricator.wikimedia.org/T91194#1181940 (10scfc) p:5Triage>3Normal [11:08:03] 10Tool-Labs: Document how to turn shadow into master - https://phabricator.wikimedia.org/T91133#1181945 (10scfc) p:5Triage>3Normal [11:08:43] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1181946 (10scfc) p:5Triage>3Low [11:08:44] No-one? The SSH fingerprints for bastion*.wmflabs.org all seem to not match what's in Help:SSH Fingerprints on the wikitech wiki. [11:11:52] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1181956 (10scfc) p:5Triage>3Normal This is also important for testing that we don't introduce regressions in our setup. I have no clue... [11:12:01] 10Tool-Labs: qstat doesn't always work - https://phabricator.wikimedia.org/T85774#1181958 (10dnaber) We're not running on Tool Labs anymore, so I cannot help in reproducing this. Feel free to close. [11:12:49] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop at Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1181963 (10scfc) p:5Triage>3Normal [11:13:53] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Monitor bigbrother - https://phabricator.wikimedia.org/T90850#1181967 (10scfc) p:5Triage>3Lowest [11:14:22] 6Labs, 10Tool-Labs, 7Tracking: Make dumps syncing to Labs NFS reliable enough (Tracking) - https://phabricator.wikimedia.org/T90848#1181968 (10scfc) p:5Triage>3Normal [11:15:17] why did no one tell me there was a hackathon? :( [11:19:46] 6Labs, 10Tool-Labs: Set up sufficient monitoring for toollabs - https://phabricator.wikimedia.org/T90845#1181990 (10scfc) p:5Triage>3Normal [11:23:08] 10Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1181996 (10scfc) AFAIUI, `-Xmx250M` means that the Java //application// gets 250 MByte "memory"; the Java VM will add its own footprint to that. A test program: ``` public class HelloWorld { public static void main (String... [11:24:43] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1181999 (10scfc) p:5Triage>3Normal a:3coren [11:25:49] 10Tool-Labs: Add 'file a bug' link to tool labs error pages - https://phabricator.wikimedia.org/T90570#1182004 (10scfc) p:5Triage>3Low [11:26:12] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1182006 (10scfc) p:5Triage>3Normal [11:30:46] 10Tool-Labs: Common http error response pages - https://phabricator.wikimedia.org/T89864#1182010 (10scfc) p:5Triage>3Normal IIRC in the past the proxy overwrote individual tool error pages to ensure that the error pages clearly stated who was to blame for an error and not all reports got routed to `webmaster... [11:31:17] 10Tool-Labs, 10Wikimedia-Labs-Infrastructure: Make ar_content_format and ar_content_model available on ToolLabs - https://phabricator.wikimedia.org/T89741#1182014 (10scfc) p:5Triage>3Normal [11:32:03] 10Tool-Labs, 7Monitoring: Track and alert based on gridengine error states - https://phabricator.wikimedia.org/T88237#1182017 (10scfc) p:5Triage>3Normal [11:39:54] 10Tool-Labs: qstat doesn't always work - https://phabricator.wikimedia.org/T85774#1182021 (10scfc) 5Open>3Invalid a:3scfc [11:41:32] 10Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1182024 (10scfc) 5Open>3Invalid a:3scfc >>! In T85774#1181958, @dnaber wrote: > We're not running on Tool Labs anymore, so I cannot help in reproducing this. Feel free to close. [11:46:45] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/GoldenRing was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=152324 edit summary: [12:32:51] Amir1: It's still broken [12:33:07] :( [12:33:23] I check [12:33:33] the crontab seems okay [14:31:36] 6Labs, 10hardware-requests, 6operations: eqiad: (6) labs virt nodes - https://phabricator.wikimedia.org/T89752#1182146 (10Cmjohnson) Andrew, What naming convention do you want to use? Stick with labs10xx for now or start with something new? Also, do you want these in row D or are there any cisco's we ca... [15:24:30] 6Labs: Renaming scheme for labs servers - https://phabricator.wikimedia.org/T95042#1182253 (10Andrew) Rob -- I don't mess with the labsdb boxes much. I'd be happy for them to be renamed but it may not be worth the trouble. And, yes, as per... the entire text of this ticket, virt* will become lab* [15:26:12] 6Labs, 10hardware-requests, 6operations: eqiad: (6) labs virt nodes - https://phabricator.wikimedia.org/T89752#1182256 (10Andrew) According to the naming scheme in https://phabricator.wikimedia.org/T95042, let's name these boxes 'labvirt10xx'. You can start with 1001 and when I re-image the other HPs I'll r... [16:40:20] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Ensure that all running webservices have a services.manifest file - https://phabricator.wikimedia.org/T95095#1182557 (10yuvipanda) [16:40:41] hey Coren_away! are you working today? [16:40:52] or is easter off for canada too? [16:51:46] 6Labs: Renaming scheme for labs servers - https://phabricator.wikimedia.org/T95042#1182577 (10yuvipanda) renaming labsdb should be done very carefully, since user code still relies on it being called labsdb*, and there’s also a few hundred unpuppetized /etc/hosts files running around. we could ‘fix’ it by having... [16:52:18] 6Labs: Support bare-metal server allocation in labs - https://phabricator.wikimedia.org/T95185#1182578 (10Andrew) 3NEW [17:24:23] 6Labs, 10Tool-Labs, 10Analytics: Make anonymized clickstream data available to the public - https://phabricator.wikimedia.org/T91495#1182724 (10DarTar) We released [[ http://figshare.com/articles/Wikipedia_Clickstream/1305770 | static datasets ]] and experimented with a lightweight API on Labs, designed by @... [17:25:22] 6Labs, 10Tool-Labs, 10Analytics: Make anonymized clickstream data available to the public - https://phabricator.wikimedia.org/T91495#1182729 (10DarTar) [18:23:56] (03CR) 10Southparkfan: [C: 031] feed #wmt to new channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/199329 (owner: 10John F. Lewis) [18:35:00] (03PS2) 10Awight: Correct Fundraising project tag regex [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/199665 [18:35:15] (03CR) 10Awight: "Weekly ping :p" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/199665 (owner: 10Awight) [18:35:33] (03CR) 10Legoktm: [C: 032] Correct Fundraising project tag regex [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/199665 (owner: 10Awight) [19:47:08] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183223 (10yuvipanda) 3NEW [19:51:27] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183241 (10jeremyb-phone) [19:52:25] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183244 (10Legoktm) collector.py: * toollog: just append to the file ('a') instead of 'w'riting to it * collect: use yaml.safe_load manifest.py: * missing space between =( in... [19:52:31] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183245 (10yuvipanda) [20:28:34] YuviPanda: can haz approval? https://www.mediawiki.org/w/index.php?title=Special:OAuthListConsumers/view/2df1a77fd9d4fefa8211ec875ced46b1&name=&publisher=Ragesoss&stage=-1 [20:29:57] ragesoss: why do you need edit / move, btw? [20:31:23] YuviPanda: in this version, one of the things you can do (if you the instructor of a class) is leave messages on your students' talk pages. [20:32:08] ragesoss: ah cool. approved. [20:32:13] thanks much! [20:32:33] yw [20:33:32] how can I test a puppet patch that adds a new trebuchet repo? [20:33:40] andrewbogott_afk: whenever you're back around could you take a look at the staging project? Seems like some recent changes to the value of `facter -p domain` have changed the puppet master configuration which is confusing our instances. [20:34:13] I tried making a machine into a self-hosted puppetmaster + self-hosted saltmaster, but puppet just hangs after that [20:40:24] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183630 (10yuvipanda) I have imported it from github to gerrit, is at operations/software/tools-manifest now. I'll move it to python 3.4 shortly as well. [20:49:32] tgr: where is the new self-hosted instance? [20:51:01] andrewbogott: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000a5e.eqiad.wmflabs [20:51:37] puppet runs fine as long as the sentry role is disabled [20:52:01] when it's enabled and tries to force a trebuchet install, puppet hangs [20:52:37] ok… so, presumably something is wrong with the sentry puppet code? [20:53:02] Sorry, your earlier comment made me think that there was a general problem, now it sounds like you just need to debug :) [20:53:34] thcipriani: can you give me a specific instance to look at? Is there a local puppet master for that project or are they just using the normal labs master? [20:53:53] andrewbogott: it’s using a local puppetmaster, staging-palladium [20:54:02] andrewbogott: so if you login to the puppet master for the staging project, staging-palladium [20:54:30] thcipriani: I’ve clicked on a bunch of instances in that project, so far none of them have a custom master defined [20:54:35] But, ok, I’ll look at staging-palladium [20:54:37] in /root/ if you do: head -n 300 /var/log/puppet.log.3 | less -FirSX [20:54:59] well, it isn’t a puppetmaster either [20:55:12] exactly, because the fqdn changed. [20:55:46] so puppet::master::self::master defined here: https://wikitech.wikimedia.org/wiki/Hiera:Staging [20:55:50] is no longer correct [20:56:19] oh, it’s set /there/ [20:56:35] how do I find out who created/which wikitech/nova project owns a specific something.wmflabs.org domain? [20:58:47] greg-g: rephrase? [20:58:52] too many slashes :p [20:59:03] andrewbogott: so if you do look at that puppet log file I ungziped in /root/ you'll see a bunch of config changed recently, not sure how to back out of that. [20:59:25] greg-g: what is the domain is usually a good place to start unless this is a question for the future? [21:00:09] JohnLewis: task.wmflabs.org [21:00:17] who owns/created that? :) [21:02:36] thcipriani: it looks to me like someone switched use_dnsmasq to ‘false’ and then back to ‘true’ again — is that right? [21:03:41] andrewbogott: I didn't do that, lemme ping over in -releng, see if anyone tried that. For staging-palladium you're asking? [21:03:48] yep [21:04:17] andrewbogott: also greg-g wants to know which project owns task.wmflabs.orgs when you have a spare second [21:08:28] andrewbogott: It doesn't look like anyone changed it back and forth. [21:08:54] greg-g: I might be able to find out after just host'ing the domain [21:09:31] actually, nevermind, can't see assigned floating IPs :( [21:09:35] are they using domainproxy? [21:10:03] andrewbogott: is that the only way existing servers opt-in the domain change? [21:10:17] thcipriani: should be, yes [21:10:32] mutante: it seems its literally just assigned to an IP from within wikitech [21:11:00] huh, just confirmed with everyone who has access to the staging project save, "novaadmin" whatever that account is [21:11:16] novaadmin is like root [21:11:31] well, the wiki user 0 [21:11:38] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183776 (10scfc) Not complete: - `collector.py`: -- `toollog()`: --- Files created in the tool's directory by `root` should be deletable and even modifyab... [21:12:28] thcipriani: even though greg-g emailed asking if he should do it, earlier today? [21:12:51] not if *I* should do it, if we should do it, I was just making sure everyone saw it [21:13:58] andrewbogott: I doublechecked with everyone listed the project list page. [21:14:52] plus, the log file where the config changed is from the 5th: for whatever reason we'd disabled puppet runs on palladium, this is just what happened after re-enabling. [21:20:20] thcipriani: I’ve fixed puppet on that instance. I don’t know why the hiera manifest isn’t getting picked up. [21:20:30] I’ve never seen hiera work, personally, so that’s something to bug YuviPanda about. [21:21:24] Oh, well, /now/ it is seeing hiera, even though it didn’t in the last few runs. [21:21:25] *shrug* [21:23:23] stupid non-deterministic computers [21:23:59] andrewbogott: awesome, thanks for the fix. [21:24:47] I just noticed that staging-mx.eqiad.wmflabs did the same thing: sudo facter -p domain returns staging.eqiad.wmflabs [21:25:16] ok, I’ll look [21:26:30] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183883 (10yuvipanda) Avoiding symlink race conditions seems to be fun :) >>! In T95210#1183776, @scfc wrote: > Not complete: > > - `collector.py`: > -... [21:28:21] andrewbogott: I think it may be all instances in this project that have changed domains, I just checked staging-mc1 and staging-mc2 and they've got the same thing going on. [21:29:00] I think facter might be doing something wrong? [21:29:03] (or ‘right' [21:30:31] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1183911 (10yuvipanda) One of the differences from take is that it ignores symlinks, while I don't want to. [21:31:23] Nah, resolv.conf is set to use the extended fqdn on all these instances. [21:31:26] Facter is just reading that. [21:32:43] ok, so here’s what’s happening… [21:32:50] All instances have use_dnsmasq set in ldap, but not in heira. [21:33:13] When instances switch over to using the local puppetmaster (staging-palladium) they suddenly think that use_dnsmasq is lowered. [21:33:41] YuviPanda: So it seems like these instances aren’t looking at ldap at all, for variables. [21:33:45] Is that possible? [21:33:52] ooooohhh [21:33:55] that’s actually possible [21:33:58] let me look [21:35:14] andrewbogott: actually no, the ENC is passing ldap variables on... [21:35:36] ok... [21:35:46] so that should be fine... [21:35:48] Why is use_dnsmasq lowered on these instances then? [21:36:13] what do you mean by ‘lowered’? [21:36:23] unset [21:36:27] it’s set to ‘true’ in ldap [21:36:30] hmmm [21:36:40] but I can see from the state of resolv.conf that it’s not ‘true’ for puppet [21:37:03] let me pop on to staging-palladium and see what’s happening [21:37:19] thanks [21:38:09] andrewbogott: you can run /usr/local/bin/ldap-yaml-enc.py to see what’s the list of classes and params supplied to it [21:38:28] YuviPanda: run that where? on the instance? [21:38:34] andrewbogott: on staging-palladium [21:38:56] ec2id.eqiad.wmflabs [21:39:03] parameters: {instancename: staging-palladium, instanceproject: staging, realm: labs, [21:39:03] use_dnsmasq: 'true'} [21:39:04] andrewbogott: ^ [21:39:12] use_dnsmasq: true is being passed along [21:39:15] ‘true’ is not the same thing as true though [21:39:31] oh, is puppet expecting true rather than ‘true’? [21:39:40] LDAP only deals with strings.. [21:39:48] doesn’t do types. [21:40:06] … [21:40:39] or at least that’s what I thought / that’s what the python libraries docs seem to suggest? [21:40:39] I’m pretty sure that if I put use_dnsmasq=‘true’ in ldap it behaves differently from if I say use_dnsmasq=true [21:40:46] So, even though it’s a string in ldap, the puppet/ldap integration is handlign it properly. [21:40:51] Which it seemst the enc is not. [21:40:53] I see. [21:41:04] I’m just guessing, here, but that would explain the behavior. [21:41:19] I wonder if the puppet/ldap one is special casing things like that... [21:41:49] The template says <% if @use_dnsmasq == true then -%> [21:42:03] There’s probably some way in erb to specify set vs. unset rather than an explicit value... [21:42:07] does (‘true’ == true) == true? [21:42:11] but this would still be broken for other booleans. [21:42:17] Not in puppet, i don’t think so [21:42:23] I think that’s the right way to go for LDAP, honestly... [21:43:14] I kind of think that the enc should work properly :) Leaving code in there that can change a value is risky. [21:43:27] I don’t know what happens for other resources, like present vs ‘present’ etc. [21:43:46] andrewbogott: yeah, but then you can never set a ‘string’ ‘true’... [21:43:53] andrewbogott: and in this case the ENC can just guess at values... [21:44:12] in ldap? You can set it to ‘true’ instead of true [21:44:13] if you need to [21:44:30] But use of booleans-in-strings is a known pitfall in puppet, we had a big purge a while ago [21:44:54] hmm [21:44:59] LDAP does seem to have types [21:45:00] http://www.zytrax.com/books/ldap/apa/types.html [21:45:10] but the library is giving me back only strings... [21:47:05] YuviPanda: what decides that that particular set of instances is using the enc? [21:47:15] andrewbogott: there’s a hiera setting for that [21:47:30] Huh, that’s very self-referential :) [21:47:35] Do we need to use it, in this case? [21:48:53] (03PS1) 10Greg Grossmeier: Remove Quality Assurance from -releng [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/202251 [21:49:38] YuviPanda: https://phabricator.wikimedia.org/T95240 [21:49:39] andrewbogott: for staging? yeah,. [21:50:12] 6Labs, 10Staging: Labs puppet ENC scrambles booleans - https://phabricator.wikimedia.org/T95240#1184023 (10yuvipanda) [21:55:14] YuviPanda, thcipriani, meanwhile I’ve set use_dnsmasq=true in the hiera setup for the project, that should get things back to normal… gradually :( [21:55:40] andrewbogott: so was this triggered automatically or is this simply because it is using the enc? [21:55:52] YuviPanda: I don’t understand the question [21:56:30] thcipriani: probably you can revive an instance by hand-editing puppet.conf, and possibly resolv.conf [21:58:03] andrewbogott: ok, noted, thanks for digging in here, appreciated. [21:58:36] thcipriani: if you’re not able to rescue the instances let me know and I’ll try to jot down a step-by-step. Sorry you got bit by this. [21:59:55] kk, no problem, as long as there's a way back. Also, staging should be designed to drop and rebuild instance without manual fiddling, good of a time as any to test that. [22:06:26] andrewbogott: as a workaround, we can also set it to true for all of labs on hiera atm... [22:06:31] and then I can look into the ENC tomorrow... [22:06:42] YuviPanda: well, I need it false in some cases [22:06:44] For people who are testing [22:06:49] andrewbogott: yeah, that can be overriden, no? [22:07:00] we just make ldap override hiera. [22:07:07] and local per-project hiera overrides labs-wide hiera... [22:07:10] (like https://gerrit.wikimedia.org/r/#/c/201942/) [22:07:24] Can the absence of a variable in ldap override the presence of one in hiera? [22:07:58] if $::use_dnsmasq == undefined [22:08:01] What I mean is, right now false and unset have the same meeting [22:08:02] to test for absense of it in LDAP [22:08:23] and hiera(‘use_dnsmasq’, $::use_dnsmasq) for hiera... [22:09:34] that would be in the puppet code? Or in the hiera settings? [22:10:40] andrewbogott: puppet code, yeah. [22:11:28] andrewbogott: basically, set it to whatever we want in hiera, and then we can have the ldap variable overwrite all of hiera, and then hiera will allow itself to be overriden more granularly [22:11:36] I can make a patch in an hour or so. heading out for a quick bite brb [22:11:43] ok [22:11:57] I think that sounds fine, although I’m still hoping that fixing the ENC will not be impossible :) [22:20:11] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1184120 (10Pine) >>! In T90534#1180253, @yuvipanda wrote: > @Pine @scfc since this seems to be a common enough confusion, I've created a page at https://wikitech.wikimedia.org/wiki/Labs_labs_lab... [22:33:08] andrewbogott: I think I'm going to need a step-by-step. For whatever reason, staging-palladium just undid your fix in a subsequent puppet run :( [22:55:59] andrewbogott: looks like wikitech:hiera gives the string "true" as well :\ https://phabricator.wikimedia.org/P482 [22:57:36] is this the appropriate place to report a problem with a tool on WMFLabs.org > [22:57:38] ? [22:58:15] thcipriani: in that case I can’t help you — yuvi will have to fix the ENC. [22:59:19] Dragonfly6-7: go ahead and just report it/say what's going on :) [22:59:21] 6Labs, 10Staging: Labs puppet ENC scrambles booleans - https://phabricator.wikimedia.org/T95240#1184203 (10Andrew) Apparently even if something is set as use_dnsmasq: true in the hiera file the ENC reads it as a string. [23:00:15] when the GLAMorous tool gives results that include wikidata, those results point to "wikidata.wikipedia.org" instead of "wikidata.org" [23:01:37] http://wikidata.wikipedia.org/w/index.php?title=Q3430896 , for instance [23:01:38] which is dead [23:01:47] er, which is a dead link [23:03:51] andrewbogott: thcipriani bah, am back... [23:04:04] that’s strange. the ENC has nothing to do with hiera [23:04:13] am taking a look now [23:05:10] YuviPanda: I also just tried adding it to /etc/puppet/hieradata/staging/common.yaml [23:05:16] same result [23:06:09] so that would mean the mwyaml, nuyaml backends as well as the enc don't support booleans [23:08:41] hmmm [23:08:43] chasemp: so phab_update_tag does a schema upgrade [23:08:52] they do, because we ran into this for something else, and me and _joe_ fixed it, I think... [23:08:57] and you want that to run instead of bin/storage upgrade [23:09:29] not where you put it no [23:09:35] but as a one-off on first install in labs only [23:09:36] sure [23:10:01] Negative24: ^ [23:10:16] ah I see [23:10:25] labs only and first boot [23:10:37] well, any other time where phd is stuck on something db wise [23:10:40] it would mean big problems [23:10:51] what case are you trying to solve? [23:11:21] the problem is that there is no way to check for phd db problems. I asked epriestley and he said just to run it every time hence what I did [23:11:26] thcipriani: andrewbogott aha! it’s not actually picking anything up from hiera at all. [23:11:36] you need to use hiera(‘name’), otherwise it doesn’t do anything [23:11:38] * YuviPanda makes patch [23:11:39] Negative24: where did you see the problem though? I don't understand [23:11:43] how you ended up trying to solve this problem [23:11:51] we don't ever suffer this in prod etc [23:11:59] its on first run [23:12:07] ok so yes first run [23:12:22] the docs say to not worry about errorss until after running phab_update_tag [23:12:24] for ours [23:12:27] puppet fails cause phd can't start [23:13:17] the practical problem is puppet is bad at this [23:13:24] any "one time" or "creates" solution [23:13:34] is not fool proof but could really do damage in prod [23:13:50] so the solution was to not worry about a hiccup on install and document it [23:14:01] but sans that, another standard in labs seems reasonable [23:14:47] I would use it to execute the script and create a lock file afterwards which then uses creates to check afterwards [23:15:06] yeah understood [23:15:14] but if that file goes missing accidentally [23:15:22] or [23:15:26] now you are running schema upgrades unexpectedly [23:15:29] which is like all down side [23:15:31] well that's it [23:15:44] puppet can't create lock files for first run only [23:15:54] puppet is difficult [23:16:03] yes puppet sucks at this in particular [23:16:06] one way [23:16:15] is to write a query or somethign and use 'unless' [23:16:19] but it gets murky [23:16:26] and in general the solution is worse than teh problem [23:16:37] hrm [23:16:54] and are schema upgrades really, really bad? [23:17:09] if there aren't any upgrades it just skips it [23:17:19] uh yes [23:17:22] i would trade that against "runs after second puppet run" anytime [23:18:05] it's not that they are bad, it's that they are unreversable [23:18:11] mutante: I'm not sure what you mean. [23:18:21] run upgrades every second run? [23:18:22] and a forced schema upgrade is just not worth the risk in prod [23:19:04] i mean it's worse to have unexpected schema upgrades and that it needs more than one puppet run is not that big of a deal [23:19:07] for an only at first install and never again extra step [23:19:26] yeah I'm with mutante on this one [23:19:40] but if it's hard for labs I can see it being worked out here [23:20:03] * Dragonfly6-7 waves [23:20:40] i'll see what i can work out of puppet [23:20:46] Dragonfly6-7: hi! I think your best bet is to find the maintainers of the tool and report it to them. so unless they are already here, I don’t think this is an appropriate place... [23:20:54] this only deals with the underlying infrastrcuture the tools run on [23:21:21] for now we are sticking to a labs only scope [23:23:17] mutante: but lets get the phab upgrade in labs patch working [23:24:03] chasemp: have you had a chance to look at https://gerrit.wikimedia.org/r/#/c/201857/ [23:24:45] basically i don't know about the spring_tag [23:24:51] sprint [23:25:24] Negative24: not yet but if that is to match prod [23:25:31] can be merged sure assuming main tag matches prod [23:25:32] that is [23:25:41] YuviPanda - I don't even know how to look at a tool and ascertain who runs it [23:25:44] I do know for a fact that with phab on 2015-02-18/1 and sprint on 2015-01-08 it will break [23:25:59] one of my first mistakes [23:26:14] Dragonfly6-7: 1. go to tools.wmfalbs.org, 2. look for the tool name, 3. maintainers are listed there [23:27:50] chasemp mutante: and its not like a sprint upgrade is that risky. I've had it working on phab-02 since the beginning. [23:28:50] ah, Magnus Manske [23:29:00] does he come on IRC? [23:31:37] Dragonfly6-7: nope [23:31:49] Dragonfly6-7: you could leave him a message on enwiki / dewiki [23:31:49] i guess [23:32:51] oh well [23:32:52] thank you [23:33:05] chasemp: https://gerrit.wikimedia.org/r/#/c/201857/4 [23:34:55] done [23:35:08] thanks [23:35:29] agh, he says that for problems with his lab tools, to use bitbucket [23:35:38] I'm looking at bitbucket [23:37:11] let's see if I can figure this out [23:40:25] i think all tool maintainers should use standard, accessible channels of communication/bug reporting (wiki, phabricator, email) :\ [23:47:18] YuviPanda: nginx, maybe? [23:47:39] you know, I’m still quite very sure it isn’t sendfile causing your issue :) [23:47:40] but [23:47:41] let me do it [23:47:43] anyway [23:47:46] and see what happens, yeah [23:49:05] Negative24: alright, sendfile turned off [23:49:06] try? [23:49:29] * Negative24 is testing [23:50:09] YuviPanda: https://phab-02.wmflabs.org/ [23:50:22] * Negative24 is not going to say told you so [23:50:35] haha [23:50:43] you totally can btw :) [23:50:47] so [23:50:52] can we keep? [23:50:55] are you sure it’s the sendfile on and not the fact that nginx was restarted? [23:51:14] I’m turning it back on to see if that was the case. [23:51:20] go ahead [23:51:53] Negative24: ok, it’s back on now [23:51:57] and still works... [23:52:12] that may not reproduce because it needs to cache a old css [23:52:22] now its updated with the new one [23:52:36] anyway, thank you [23:52:46] did changing that cause a nginx restart [23:53:03] Negative24: yeah, I hand-changed it and did a restart [23:53:14] hrm [23:53:23] I’m still unsure how this is nginx related, though. we have no caching directives at all [23:53:38] well keep it on and when it happens again I'll ping you to try [23:53:41] or [23:53:50] well that wouldn't get us anywhere [23:53:51] yeah, totally. [23:53:58] well, it works now.. [23:54:06] when was the last time you tested it? [23:54:18] I've been refreshing all day [23:54:23] hmm [23:54:33] so it’s strange. quite. [23:54:37] keep sendfile on and when it happens again just restart it [23:54:40] did you do a curl on localhost? [23:54:45] yep [23:54:47] yeah, and see if that affets it [23:54:56] Negative24: what’s the instance name called? [23:55:07] just to narrow it down to the restart or sendfile [23:55:12] phab-02 [23:55:48] right [23:57:25] YuviPanda: Was there an NFS issue of some kind? Just ran into https://phabricator.wikimedia.org/T62862#1167560 again with a dozen duplicate processes [23:58:22] Krinkle: hmm, no NFS issues, but - are you using bigbrother? [23:58:25] maybe that h as a bug... [23:58:36] YuviPanda: I gave on up using bigbrother for other than webservice. [23:58:40] Only gave me email spam [23:58:54] Maybe I used the wrong format [23:59:08] :) it’s a fairly terrible system. replacement coming! [23:59:09] anyway [23:59:20] Krinkle: I’m not sure, no. it’s possible that the locking -once uses is just not robust enough [23:59:54] 6Labs, 6Phabricator: Phab-02 sending old stylesheet copies - https://phabricator.wikimedia.org/T94413#1184368 (10Negative24) a:3Negative24 @YuviPanda turned off sendfile (which also restarted nginx) and that seems to have fixed the problem but we don't know whether the restart or sendfile fixed it. I'm going... [23:59:58] YuviPanda: */2 * * * * /usr/bin/jsub -N ecmabot-wm -once -continuous -quiet -stderr -mem 1700M node ~/apps/oftn-bot/wm-ecmabot.js > /dev/null 2>&1