[00:00:20] what is the class? [00:00:54] role::releases::upload [00:01:06] -> role::microsites::releases::upload [00:01:22] it's included in the role deployment::server [00:01:28] that is on tin and mira [00:01:38] it lets people upload stuff _from_ there to releaess.wm [00:01:59] so it would affect deployment-beta-xxx [00:02:16] the equivalents of tin/mira [00:02:25] deployment-tin and mira [00:02:41] (which should be nuked and rebuilt with the proper name :/) [00:02:41] ok, thanks, i will go and configure them correctly [00:02:50] by clicking in wikitech ui..right [00:03:03] do we really have releases::upload in beta? [00:03:18] probably not [00:03:27] you can check ldap for these [00:03:36] the watroles tool? [00:03:55] mutante: is it already included in another role? I don't remember ever applying a role with a name like that manually [00:03:57] simple ldapsearch [00:04:18] bd808: yes, it's included in role deployment::server [00:04:39] it should just magically work then [00:04:43] if it's applied [00:04:55] yes, it's fixed in prod 2 minutes ago [00:05:05] ok, great [00:05:17] * bd808 looks at deployment-tin config just for fun [00:06:08] it has role::deployment::server applied [00:06:40] okay, thanks [00:07:09] e.g. ldapsearch -x puppetClass=role::cache::upload [00:07:11] all of this is still to move manifests/role/ to module/role/manifests/ [00:07:20] and i combined a couple of the tiny apache sites [00:07:23] into that common module [00:07:25] that's why [00:19:15] RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:02:27] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:07:49] !log ores testing puppet on all instances before merging change that moves role classes to module [01:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [01:16:59] !log ores merged https://gerrit.wikimedia.org/r/#/c/270102/ - role classes have been moved to modules/role and split into one file per class [01:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [01:17:15] !log ores but no class names have changed and confirmed no-op on every single instance [01:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [01:17:39] !log ores while doing that i noticed that ores-web-02 has an "Cannot allocate memory" problem [01:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [01:18:19] halfak: ^ [01:18:27] Cannot allocate memory - fork(2) [01:18:27] root@ores-web-02:~# [01:23:01] 6Labs, 10Labs-Infrastructure, 10ores: ores-web-02 - cannot allocate memory - https://phabricator.wikimedia.org/T130338#2133162 (10Dzahn) [02:24:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [02:25:34] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [06:42:07] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [12:30:09] 6Labs, 10DBA, 6Operations: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2133756 (10jcrespo) a:5Cmjohnson>3jcrespo [14:09:51] 10Labs-Other-Projects: Problem creating an account at https://discourse.wmflabs.org/ - https://phabricator.wikimedia.org/T125107#1974477 (10Rillke) I like how [[ https://commonsarchive.org | many ]] [[ https://quarry.wmflabs.org/ | tools ]] use oAuth for login. It just takes me a click and I am logged-in. [14:38:18] 6Labs, 10Tool-Labs: Overhaul logging setup for Tools (Tracking) - https://phabricator.wikimedia.org/T127367#2134014 (10chasemp) p:5Triage>3High [15:10:33] Betacommand: are you about? what is /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs [15:18:24] Betacommand: there is an excessive of 970G of irc logs there that are not accessible via web and are a dupe of thigns logged by wm-bot? [15:18:57] on an 8T share for all of Tools that isn't sustainable [15:34:16] 974G of that is SpamBotLog because the bot appears to be broken and throwing constant errors [15:36:28] !log tools cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.* [15:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:47:34] !log tools had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:48:02] death to bad bots! [15:49:03] 6Labs, 10DBA, 6Operations: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2134158 (10jcrespo) List of tables to reimport: {P2792} [15:50:02] chasemp: what's the policy/practice for killing things that are running on tools-bastion-* that shouldn't be? [15:50:16] It's a known violation so I just do it as needed [15:50:21] * bd808 watches a nodejs process soak up a lot of cpu [15:50:24] there is the -dev host which afaik is the place for that [15:50:34] bd808: ok so I hvae a job that I kill and it respawns [15:50:40] I'm not sure if it's our custom logic respawing or what [15:50:45] but it's just logging errors like crazy [15:50:59] root@tools-exec-1202:~# lsof | grep betacommand | grep -i spam [15:51:00] twistd 20920 tools.betacommand-dev 4r REG 0,23 47957621272 108596996 /data/project/betacommand-dev/tspywiki/irc/logs/SpamBotLog.log [15:51:08] bd808: any thoughts on how to nuke that? [15:51:43] if it's a continuous job you need to stop the job itself [15:51:51] or the grid will restart it [15:51:57] I did stop it via teh grid [15:52:01] hmm [15:52:05] but I think this is like a host of tools running under one "tool" [15:52:06] bigbrother? [15:52:35] Oh I'm sure. that's dead common for folks who came over from toolserver [15:52:40] one mega tool [15:52:58] so what is supervising it idk [15:53:01] but it's a real problem [15:53:21] and the dir it private? [15:53:23] *is [15:53:30] ls: cannot open directory .: Permission denied [15:53:41] * bd808 gets out his sudo hammer [15:54:22] hm [15:54:32] so I'm changing perms on that log file I think to prevent writing to it [15:54:47] my hope is this kills that particular process's ability to cause harm without affecting the rest [15:54:54] but this is a pretty significant antipattern I think [15:55:01] yeah [15:55:12] we need a big red stop button for all the things [15:55:51] * bd808 looks for a timemachine [15:55:57] that proc has been throwing errors since the 2nd [15:56:06] to the tune of about 974G [15:56:13] ffs [15:56:20] not choice but to stop it [15:56:23] we also need quotas [15:56:31] that's on my radar as well yeah [15:56:57] that just reduced all of tools by 8% [15:57:17] I've been idly wondering if we could make everything for a user happen in a container that could be throttled [15:57:34] do you men a "job" or the actual user [15:57:37] like on bastion [15:57:45] both actually [15:57:54] sort of and to the same effect [15:58:16] there is pam.d systemd that can use cgroups to basically sandbox all users to a resource pool [15:58:20] that I am poking at when I can [15:58:23] that's my plan there atm [15:58:33] and containers are pretty much built around cgroups in many ways [15:58:38] so that's very acheivable [15:58:39] like if you ssh'd in it would spawn a container that you were trapped in and then just about all you could do would be to spawn another container to run one script. [15:59:00] yeah but instead of that exactly for users you drop them in a cgroup that limits cpu and ram and io etc [15:59:05] same basic effect [15:59:08] nod [15:59:10] less messy..I think [15:59:47] I guess what I'm really dreaming about is heroku [16:00:01] where you don't really have shell access at all [16:00:02] ah yeah, we have some problems they don't actually [16:00:04] right [16:00:19] I really think this is doable to contain the damage a user can do accidentally [16:00:23] I just need some time to dig into it [16:00:41] * bd808 hands chasemp some nodoz [16:01:09] I went down teh weirdest rabbit hole imaginable trying to throttle / shape NFS that included a detour into systemd and cgroup land [16:01:15] and while it was 10 ways not to do that exactly [16:01:21] I came out with some ideas on this that seem pretty solid [16:01:38] i A libpam-systemd - system and service manager - PAM module [16:01:41] cool. It seems like these should really be solved problems [16:02:02] that package basically allows the integration of user login and systemd and cgroups etc [16:02:04] like there are shell providers and shared hosting cos all over the place [16:02:23] and universities too [16:02:39] right, it's done a few ways taht I have seen [16:03:04] oddly a lot of university help pages is where I got a few good ideas [16:03:25] i.e. help pages explaining usage patterms and limitations for shared systems [16:03:32] I was thinking, oh yeah this is what we need [16:03:35] not so odd really. that's where all the good docs on how the internetz work used to be [16:03:49] * bd808 misses gopher some days [16:05:54] creating a perm wall did the trick however distastful [16:06:22] we need to get logs off of nfs in general too [16:06:28] as I know you know [16:06:41] I can start helping on that next month [16:06:53] which is really soon :) [16:33:23] chasemp: sorry something must have really broken, it shouldnt have been that big [16:33:50] chasemp: looking into it now [17:24:14] chasemp: sorry about the log file, it looks like we updated the twisted python package which changed something about how it logs data and it was freaking out. Ive gone ahead and tryed to patch it if it freaks out feel free to kill 4450940 [17:24:35] Betacommand: understood, thanks for looking into it [17:25:28] chasemp: If something breaks in that way drop me an email Ill take a look ASAP [17:25:49] I normally only use ~2gb of disk space [17:26:52] Im at 1.2GB right now [17:27:15] right it was all runaway errors or 99.99% [17:28:41] chasemp: it didnt help that it was logging everything 140 times and appending a new timestamp each time [17:30:09] * Betacommand goes back to not freaking out [17:42:42] 6Labs, 6WMF-Legal: Ensure that Terms of Use document restrictions on third-party web interactions - https://phabricator.wikimedia.org/T129936#2134672 (10tom29739) I've been told that 3rd party server interactions were not allowed full-stop, so it would be good to get it clarified. [17:49:28] 6Labs, 6WMF-Legal: Ensure that Terms of Use document restrictions on third-party web interactions - https://phabricator.wikimedia.org/T129936#2120419 (10Dzahn) Does this include cloning from github? [18:13:49] !log deployment-prep activating automatic deployment of portals (https://gerrit.wikimedia.org/r/#/c/276397/) [18:13:49] Please !log in #wikimedia-releng for beta cluster SAL [18:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [18:38:51] 6Labs, 10Labs-Infrastructure, 6Revision-Scoring-As-A-Service, 10ores: ores-web-02 - cannot allocate memory - https://phabricator.wikimedia.org/T130338#2134884 (10Halfak) [18:47:56] 6Labs, 10Labs-Infrastructure, 6Revision-Scoring-As-A-Service, 10ores: ores-web-02 - cannot allocate memory - https://phabricator.wikimedia.org/T130338#2134899 (10Halfak) Looks like an intermittent issue. ``` halfak@ores-web-02:~$ sudo puppet agent -tv Info: Retrieving pluginfacts Info: Retrieving plugin... [18:49:26] 6Labs, 10Labs-Infrastructure, 6Revision-Scoring-As-A-Service, 10ores: ores-web-02 - cannot allocate memory - https://phabricator.wikimedia.org/T130338#2133150 (10Halfak) I'm going to resolve this in favor of T130394. [18:49:32] 6Labs, 10Labs-Infrastructure, 6Revision-Scoring-As-A-Service, 10ores: ores-web-02 - cannot allocate memory - https://phabricator.wikimedia.org/T130338#2134919 (10Halfak) 5Open>3Resolved [19:42:07] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2135094 (10chasemp) [19:42:10] 6Labs, 10Labs-Infrastructure, 6Operations: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#2135093 (10chasemp) 5Open>3Invalid [23:57:51] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]