[00:00:30] kaldari, well, Cyberbot won't use it for memory purpose. It already has a comprehensive memory management system for crashes and restartds [00:01:10] But for avioding redundant tasks it will help. That's assuming the bots logging there do the same job as Cyberbot [00:01:39] I don't think it's unreasonable to say that of all the bots, Cyberbot is probably the most advanced [00:02:03] And it keeps getting more advanced [00:03:26] kaldari, Cyberbot's newest code should not only look nicer, but offer some performance benifits. [00:03:36] cool [00:04:01] kaldari, but I have yet to test the performance enhancing code [00:04:17] I'm about to do that. [00:23:44] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 5Patch-For-Review, 5WMF-deploy-2016-02-09_(1.27.0-wmf.13): Wikitech often loses track of internal openstack/nova session - https://phabricator.wikimedia.org/T101199#1992488 (10Legoktm) Wooooooooooo! [00:38:49] Change on 12www.mediawiki.org a page Wikimedia Labs/Authentication improvement project was modified, changed by 75.146.81.242 link https://www.mediawiki.org/w/index.php?diff=2038126 edit summary: [00:39:52] Change on 12www.mediawiki.org a page Wikimedia Labs/Authentication improvement project was modified, changed by Tegel link https://www.mediawiki.org/w/index.php?diff=2038127 edit summary: Reverted edits by [[Special:Contributions/75.146.81.242|75.146.81.242]] ([[User talk:75.146.81.242|talk]]) to last revision by [[User:Ryan lane|Ryan lane]] [00:43:26] 6Labs, 10Tool-Labs, 6Security-Team: consider making individual tools on tool labs have their own X.tools.wmflabs.org subdomain - https://phabricator.wikimedia.org/T125589#1992538 (10scfc) I think many (most?) tools are not coded in a way that they can be served from `https://$tool.tools.wmflabs.org/` and `ht... [01:07:31] kaldari, ping again [01:19:47] Cyberpower678: hello [01:20:11] kaldari, https://github.com/cyberpower678/Cyberbot_II/tree/test-code [01:20:19] some experimental code [01:21:11] kaldari, it's pretty stable but still has some minor bugs to work out. [01:21:29] If you want to look at it already though, you're free to do so. [01:25:38] (03PS5) 10Tim Landscheidt: Add list-user-databases command [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) [01:32:11] Luke081515|away: matt_flaschen I've restarted the celery workers and things are back online. You'll need to hit submit on the stuck queries again though, sorry! [01:33:22] YuviPanda, okay, thanks. [01:33:42] YuviPanda, don't worry, it's supposed to fail. [01:33:47] :) ok! [01:33:51] http://quarry.wmflabs.org/query/7187 :) [01:33:54] is this you testing x1, I suppose? [01:34:03] nice [01:34:05] cool [01:34:31] YuviPanda, that ones not on x1, actually, which I why I picked it. [01:34:36] ah [01:34:39] cool [01:52:00] 6Labs, 10Tool-Labs, 10pywikibot-core: Tool Labs: shared Pywikibot code not available - https://phabricator.wikimedia.org/T125505#1992690 (10scfc) 5Open>3Resolved a:3scfc I ran `cat /shared/pywikipedia/core/pwb.py > /dev/null` on all instances, and it succeeded on all bastions and execution nodes. [01:52:23] 10Tool-Labs-tools-Other, 10Possible-Tech-Projects: Fix TreeViews to provide pageviews statistics for all articles of any wikiproject etc. - https://phabricator.wikimedia.org/T56184#1992696 (10mobrovac) [03:00:20] !log tools upgraded flannel on all hosts running it [03:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [03:31:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:31:39] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:31:47] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:31:47] PROBLEM - Puppet failure on tools-web-static-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:01] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:03] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:03] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:05] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:15] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:15] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:44] How many more bots in this channel will spam and annoy me. [03:32:51] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:32:59] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:33:26] PROBLEM - Free space - all mounts on tools-packages is CRITICAL: CRITICAL: tools.tools-packages.diskspace.root.byte_percentfree (<100.00%) [03:33:26] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:33:27] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:33:38] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:04] PROBLEM - Puppet failure on tools-cron-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:18] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:20] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:23] And another one added to my ignore list [03:34:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:33] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:33] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:34:59] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:35:05] PROBLEM - Puppet staleness on tools-submit is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [03:35:15] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:35:29] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:35:43] PROBLEM - SSH on tools-exec-1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:43] PROBLEM - Puppet failure on tools-web-static-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:35:57] PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:35:57] PROBLEM - Puppet failure on tools-exec-1221 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:36:09] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:36:18] 6Labs, 10Tool-Labs, 5Patch-For-Review, 7Shinken: Lots of hosts' services missing from Shinken - https://phabricator.wikimedia.org/T123271#1992770 (10scfc) 5Open>3Resolved [03:36:23] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:39:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:39:20] PROBLEM - Puppet failure on tools-exec-1220 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:39:20] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:39:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:39:24] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:42:23] 6Labs, 10Tool-Labs: Setup DNS for kubernetes services - https://phabricator.wikimedia.org/T111914#1992781 (10yuvipanda) a:3yuvipanda [03:42:46] 6Labs, 10Tool-Labs: Setup DNS for kubernetes services - https://phabricator.wikimedia.org/T111914#1619768 (10yuvipanda) We need to somehow mount the ca cert from the host to the pods and it'll be all good. [03:53:15] 6Labs, 10Tool-Labs: Setup DNS for kubernetes services - https://phabricator.wikimedia.org/T111914#1992787 (10yuvipanda) `x509: cannot validate certificate for 192.168.0.1 because it doesn't contain any IP SANs` because kube2sky attempts to contact kubernetes via the IP that's made available via the environmen... [03:54:41] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1992790 (10Yaron_Koren) [04:05:23] PROBLEM - Puppet failure on tools-docker-registry-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:08:39] 10Labs-Other-Projects: Creating new messages via e-mail - https://phabricator.wikimedia.org/T125098#1992793 (10EBernhardson) New messages created via email has been put together. The current email is a bit ugly, for example new topics in the wikimedia-l category are created by emailing `wmflabsdiscourse+wikimedi... [04:08:55] 10Labs-Other-Projects: Creating new messages via e-mail - https://phabricator.wikimedia.org/T125098#1992794 (10EBernhardson) 5Open>3Resolved [04:08:57] 10Labs-Other-Projects: Succesful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#1992796 (10EBernhardson) [04:09:28] 10Labs-Other-Projects: Creating new messages via e-mail - https://phabricator.wikimedia.org/T125098#1974344 (10EBernhardson) [04:09:30] 10Labs-Other-Projects: Set up reply via email support - https://phabricator.wikimedia.org/T125099#1992798 (10EBernhardson) 5Open>3Resolved a:3EBernhardson This has been setup. All outgoing email messages now encourage the user to reply to them. When they reply that message is added to the appropriate topic. [05:16:43] 10Labs-Other-Projects: Succesful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#1992851 (10EBernhardson) [05:16:45] 10Labs-Other-Projects: Assure secure communication on discourse.wmflabs.org https:// SSL - https://phabricator.wikimedia.org/T124829#1992848 (10EBernhardson) 5Open>3Resolved a:3EBernhardson Applied changes suggested at https://wiki.mozilla.org/Community_Ops/Discourse/Setup#Edit_web.template.yml_.28Only_f... [05:23:55] 10Labs-Other-Projects: Configure Single Sign On at discourse.wmflabs.org - https://phabricator.wikimedia.org/T124691#1992853 (10EBernhardson) I took a look over this and none of the options available currently can be done without several days of engineering effort. As @tgr mentioned above, oauth1 is unsupported... [05:25:53] 10Labs-Other-Projects: Problem creating an account at https://discourse.wmflabs.org/ - https://phabricator.wikimedia.org/T125107#1992856 (10EBernhardson) It looks lie SSO won't just be some configuration, it will require some engineering effort. I can look into what caused the failure to create an account, but n... [05:30:44] 10Labs-Other-Projects: Configure Single Sign On at discourse.wmflabs.org - https://phabricator.wikimedia.org/T124691#1992858 (10yuvipanda) One alternative that removes possible political obstacles is to write a simple intermediary that implements discourse's SSO, *and* MW OAuth login... [06:03:41] valhallasw`cloud: I took your prefix search form idea and ran with it on the Tool Labs and Labs portal page and Help:Tool Labs. Give it a look and tweak it as needed. [06:14:24] 6Labs, 10Beta-Cluster-Infrastructure: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#1992944 (10yuvipanda) 3NEW [07:00:01] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [07:17:53] 6Labs, 10Beta-Cluster-Infrastructure: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#1993022 (10yuvipanda) So: # deployment-mediawiki01/02 # deployment-tmh01 # deployment-upload # deployment-cache-upload04 These seem to be the only ones... [07:18:05] 6Labs, 10Beta-Cluster-Infrastructure: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#1993023 (10yuvipanda) a:3yuvipanda [07:27:42] 10Labs-Other-Projects: Configure Single Sign On at discourse.wmflabs.org - https://phabricator.wikimedia.org/T124691#1993035 (10EBernhardson) took a further poke around, could probably hack up https://tools.wmflabs.org/oauth-hello-world/index.php?action=download to be an intermediary between oauth1 and discourse... [07:33:25] Oh https://tools.wmflabs.org/glamtools/proxy.php?url=http%3A%2F%2Fstats.grok.se%2Fjson%2Fit%2F201601%2FTre%2520leggi%2520della%2520robotica [07:35:08] RECOVERY - Puppet failure on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [07:51:33] 10Labs-Other-Projects: Configure Single Sign On at discourse.wmflabs.org - https://phabricator.wikimedia.org/T124691#1993064 (10AdHuikeshoven) Thanks for looking into this. I understand it is several days of work no matter which option is chosen. To get this feature some funding or sponsoring is necessary or som... [09:45:05] 6Labs, 10DBA, 6operations, 5Patch-For-Review: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1993287 (10jcrespo) Related: T119154 [12:28:29] c [16:06:15] YuviPanda: seems you should be able to help me get an IRCCloud account... is that true ? [16:07:27] 6Labs, 10Labs-Infrastructure, 6Phabricator: can't log in to phab-01.eqiad.wmflabs - https://phabricator.wikimedia.org/T125666#1994127 (10mmodell) 3NEW [16:08:35] gehel: you can just register on irccloud.com? [16:09:26] valhallasw`cloud: I probably can, but my onboarding doc says we have specific accounts for WMF ... [16:09:57] valhallasw`cloud: and YuviPanda seems to be one of the contacts for that (so says the doc...) [16:10:15] gehel: ah, right, I remember something about paid accounts for irccloud. [16:21:36] gehel: contact OIT for a license I think [16:24:25] chasemp: via email to "techsupport@wm.o"? (I know, I'm really new ...) [16:25:52] iirc it's support@wmf.o? usually opsen do their first week in SF to get settled so I can't remember [16:26:00] asking in -staff is probably a decent idea [16:26:24] let's try -staff... [16:26:25] thx [16:57:47] PROBLEM - SSH on tools-worker-1002 is CRITICAL: Server answer [17:01:19] 6Labs, 10Tool-Labs: tools-packages triggers "free space" Shinken warning - https://phabricator.wikimedia.org/T125675#1994368 (10scfc) 3NEW [17:07:48] RECOVERY - SSH on tools-worker-1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [17:13:46] PROBLEM - SSH on tools-worker-1002 is CRITICAL: Server answer [17:15:41] 6Labs, 10Tool-Labs: tools-packages triggers "free space" Shinken warning - https://phabricator.wikimedia.org/T125675#1994515 (10valhallasw) We can safely remove tools-packages, as the actual build files are all on NFS (/data/project/dpkg). This sounds like it should be a general problem for all hosts with the... [17:29:58] 6Labs, 10Tool-Labs: Chinese scraper (?) with multiple IP addresses overloading wsexport - https://phabricator.wikimedia.org/T122582#1994539 (10valhallasw) 5Open>3Resolved a:3valhallasw [17:33:46] RECOVERY - SSH on tools-worker-1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [17:34:09] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: `tools.anon` causing excessive I/O - https://phabricator.wikimedia.org/T125349#1994543 (10valhallasw) 5Open>3Resolved a:3valhallasw No problem, these things happen. [17:42:28] 6Labs, 10Tool-Labs, 6Security-Team: consider making individual tools on tool labs have their own X.tools.wmflabs.org subdomain - https://phabricator.wikimedia.org/T125589#1994579 (10valhallasw) We could potentially: * redirect `//tools.wmflabs.org/$tool/$path` to `//$tool.tools.wmflabs.org/$tool/$path`, *... [18:13:44] 6Labs, 10Tool-Labs: tools-packages triggers "free space" Shinken warning - https://phabricator.wikimedia.org/T125675#1994751 (10scfc) As IIRC the root partition always is ~ 20 GByte, yes, that would probably mean manual intervention every time an instance with the `package_builder` class is launched. If we wa... [18:26:58] 6Labs, 10Tool-Labs: tools-packages triggers "free space" Shinken warning - https://phabricator.wikimedia.org/T125675#1994847 (10valhallasw) The reason it's a seperate host is because the pbuilder manifest was created for jessie, and I decided it wasn't worth my time to get it working (and keep it working) on t... [18:46:24] 6Labs, 10Tool-Labs, 6Security-Team: consider making individual tools on tool labs have their own X.tools.wmflabs.org subdomain - https://phabricator.wikimedia.org/T125589#1994903 (10scfc) I do remember thinking about this last night/this morning and finding another issue that needed to be addressed, but I fo... [18:53:32] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#1994930 (10Pine) Per discussion on the Analytics mailing list, would it be possible for the students to publish their Project Planning Document on Commons under... [18:53:43] PROBLEM - Host tools-packages is DOWN: CRITICAL - Host Unreachable (10.68.16.123) [18:56:31] 6Labs, 10Tool-Labs: tools-packages triggers "free space" Shinken warning - https://phabricator.wikimedia.org/T125675#1994939 (10scfc) 5Open>3Resolved a:3scfc I deleted `tools-packages`. [19:18:18] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#1995046 (10JEumerus) If it's text, might Wikisource work as well? According to https://commons.wikimedia.org/wiki/Commons:Project_scope, text should be used in C... [19:20:13] 6Labs, 10wikitech.wikimedia.org: some image files not getting updated properly on wikitech-static - https://phabricator.wikimedia.org/T125695#1995061 (10Andrew) 3NEW [19:31:22] 10Labs-Other-Projects: Can't SSH to snuggle-en.eqiad.wmflabs - https://phabricator.wikimedia.org/T125342#1995115 (10Halfak) Had @yuvipanda take a look and he found that there was a stale puppet lock file that was preventing puppet from running. He deleted that file and puppet ran just fine. This solved the is... [19:31:28] 10Labs-Other-Projects: Can't SSH to snuggle-en.eqiad.wmflabs - https://phabricator.wikimedia.org/T125342#1995116 (10Halfak) 5Open>3Resolved [19:31:41] valhallasw`cloud: https://gerrit.wikimedia.org/r/#/c/267402/ I'm going to try to setup paws.wmflabs.org now (eventually paws.tools.wmflabs.org) [19:31:56] partially because jupyterhub isn't fully secure to run in a shared domain [19:32:24] YuviPanda: what is @web_domain? [19:32:33] valhallasw`cloud: it was tools.wmflabs.org [19:32:37] I'm not fully sure why it exists [19:32:45] we can probably replace it with $host in all the places [19:32:46] and shouldn't you just remove the whole Host: header setting? [19:32:53] seems redundant if it's the same as what the browser sends [19:32:59] I don't think nginx passes that through [19:33:00] by default [19:33:15] I think it sets it to tools-webgrid-whatever [19:33:24] if you don't set it explicitly [19:33:55] https://www.irccloud.com/pastebin/DoUrQmI5/ [19:34:05] so yes, you'e right [19:34:30] the default is 'pass everything, override Host and Connection' [19:34:55] yeah [19:37:14] andrewbogott: ha saw your email, we actually do need to cause an outage here, moritz asked me how we can update the kernel on labstore [19:37:16] same issue [19:37:23] YuviPanda: ^ [19:37:34] which email? [19:37:37] the wikitech downtime one? [19:37:45] Moritz seems ok with letting the outage-causing boxes to slide for now [19:38:02] well in the case I would love waiting on that box [19:38:06] labstore1001 [19:38:23] I’m pretty sure that boxes that are ops-logins-only are not especially pressing. [19:38:26] labstore isn't on the list right [19:38:28] I'm not at all confident in our ability to go over to labstore1002 for awhile and even if we could it would be riskier than just dealing w/ teh downtime [19:38:41] it's not on the list, I'm saying moritz asked me directly like...yesterday? [19:38:44] ah [19:38:50] I don't think we should do the reboot now, yeah [19:39:05] the list that I’m planning to reboot are only spares, except for silver [19:39:18] (which is more of a security concern obv.) [19:40:15] chasemp: so I think we can postpone it for a month or something maybe. Until we're happy doing a reboot [19:40:17] ok, sounds good to me, wanted to get it on the group radar [19:40:19] Anyway, when I told moritz “I don’t want to reboot these, people will hate that” he didn’t push back. So probably the same goes for labstore. [19:40:27] ok [19:40:36] I'm on board [19:41:12] YuviPanda: are you not at the beach right now? [19:41:30] andrewbogott: no, i'm at the research offsite [19:41:43] (which I thought was at the beach, but maybe that’s just the after-party) [19:42:11] andrewbogott: it's the beach tonight but I'm not going [19:43:02] andrewbogott: chasemp one thing we can do is perhaps just update the kernel package but not restart [19:43:29] my dislike with that is later when it restarts for whatever reason I've forgotten it's a new kernel [19:43:39] and whatever weirdness pops up is more mysterious and probably 2 am [19:44:01] fair enough [19:50:37] chasemp: btw, I might able to kick off another 20-30 instances off NFS today/tomorrow [19:50:38] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#1995231 (10DannyH) [19:50:49] good deal, from where? [19:50:50] https://phabricator.wikimedia.org/T125624 [19:52:56] chasemp: not heavy users so [19:53:08] probably not but every bit helps [19:53:30] yeah [19:53:32] +1 [19:53:39] I have some stuff to persist to ticket in this realm and it's just always one more afternoon away for like 3 days [19:53:44] :D [19:53:58] I'm going to try de-NFS some more things [19:54:30] here is an interesting thing I figured out monday or so [19:54:44] so nfsv4 does not need mountd etc as per the spec and so redhat and everyone says [19:54:54] ah [19:54:56] but that's just rfc the reality is it still uses it for sys auth [19:55:11] but really that's it and thats' where we do our preload hack with ldap [19:55:21] hmm [19:55:28] well turns out the native mechanism w/ nfsd caches lookups in proc [19:55:41] so we do have some small caching that is roughly hard coded [19:55:48] but everytime we run exportfs it wipes out the cache [19:55:52] which we do every few minutes now [19:55:54] ah [19:55:56] yes [19:55:58] we do... [19:56:01] so I want to convert that to run only on notify [19:56:05] * YuviPanda nods [19:56:08] I can't come up with a good reason not too [19:56:11] that's the right thing to do anyway [19:56:13] and that was a big reason to [19:56:13] yeah [19:56:41] we can do it the easy hacky way where we don't run exportfs unless there's a change, or the 'right' way which would involve actual notification on instance creation / deletion [19:56:55] so on top of not caching our ldap lookups globally on the machine we wipe out our limited cache very often that nfsd uses for //every operation// [19:57:03] I guess we also don't need to run exportfs -a [19:57:13] YuviPanda: for the moment I was just going to go w/ notify on yaml file change with puppet [19:57:30] but the events mechanism is preferable it's just bigger...? [19:57:32] chasemp: that's not enough since it needs to run when new instances are added [19:57:38] not just new projects [19:57:55] blargh right [19:58:00] yeah [19:58:09] so what's reasonable to do there? [19:58:22] so write the code that checks the file outputs [19:58:32] and see if that's changed [19:58:37] and run exportfs only for changed exports [19:59:28] there's a patch to do that for shinkengen [19:59:36] that's pretty acheivable [19:59:47] yeah [19:59:53] and that's the right thing to do I guess [20:00:00] and that'll also allow us to target mounts with exportfs [20:00:02] rather than just -a [20:00:48] it may be irrelevant from this standpoint afaict to be safe exportfs wipes out all auth / lookup caching for the rpc stuff [20:00:56] I'm sure that was easiest for dev sanity [20:01:02] yeah [20:01:04] makes sense [20:01:10] would be easier for us too [20:01:32] I'm going to afk for lunch. I can take that on if you want (or you can do it too :D) [20:03:00] brb [20:15:33] 6Labs, 10wikitech.wikimedia.org: some image files not getting updated properly on wikitech-static - https://phabricator.wikimedia.org/T125695#1995312 (10Krenair) Actually if you look at what the normal wikitech is doing, you see it loads those images from commons (the one image you do get is a local file and i... [20:15:52] 6Labs, 10wikitech.wikimedia.org: InstantCommons on wikitech-static - https://phabricator.wikimedia.org/T125695#1995313 (10Krenair) [20:17:18] 6Labs, 10Beta-Cluster-Infrastructure: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#1995318 (10demon) Logging server? I think we write those syslogs to /data/project [20:33:57] 6Labs, 10wikitech.wikimedia.org: Have a process for regularly updating wikitech-static - https://phabricator.wikimedia.org/T125709#1995382 (10ori) 3NEW a:3Andrew [21:24:06] 6Labs, 10Beta-Cluster-Infrastructure: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#1995517 (10yuvipanda) I am pretty sure we killed that when we found out. [21:34:24] 6Labs, 10wikitech.wikimedia.org: InstantCommons on wikitech-static - https://phabricator.wikimedia.org/T125695#1995558 (10Andrew) I'd prefer not to update instant commons on wikitech-static. Wikitech hosts network diagrams that are potentially useful during an outage, and I don't want those diagrams to vanish... [21:53:54] 6Labs, 10wikitech.wikimedia.org: InstantCommons on wikitech-static - https://phabricator.wikimedia.org/T125695#1995673 (10Krenair) >>! In T125695#1995558, @Andrew wrote: > I'd prefer not to update instant commons on wikitech-static. Wikitech hosts network diagrams that are potentially useful during an outage,... [22:19:36] 10MediaWiki-extensions-OpenStackManager, 10Notifications, 3Collaboration-Team-Current: Update OpenStackManager notifications to new language and format - https://phabricator.wikimedia.org/T125691#1995807 (10Mattflaschen) [22:37:48] PROBLEM - SSH on tools-worker-1002 is CRITICAL: Server answer [23:41:26] 10Tool-Labs-tools-Other: SVG Translate does not accept HTTPS URLs - https://phabricator.wikimedia.org/T125743#1996293 (10Krenair) [23:45:01] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 6operations: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996302 (10Dzahn) 3NEW [23:45:40] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 6operations: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996310 (10Dzahn) since it says "was exit-code" it looks more like a typo in the monitoring script ? [23:50:28] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 6operations: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996327 (10Dzahn) on neon i can see the check commands used: ``` @neon:/etc/icinga# grep check_replicate puppet_services.cfg check_command nrpe_check!check_re...