[00:37:07] RECOVERY - Puppet run on tools-bastion-10 is OK: OK: Less than 1.00% above the threshold [0.0] [01:24:32] PROBLEM - Puppet staleness on tools-cron-01 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [43200.0] [01:30:28] PROBLEM - Puppet staleness on tools-grid-master is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [01:33:47] 10Labs-Kubernetes, 10grrrit-wm: grrrit-wm update/deployment failing - https://phabricator.wikimedia.org/T133189#2307640 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Since it works now. [01:34:39] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Decide on upgrade policy for Kubernetes - https://phabricator.wikimedia.org/T133598#2307643 (10yuvipanda) Since there are no objections, I'm going to setup a recurring window every two weeks for the k8s upgrade, preferably on Tuesdays. [01:40:47] PROBLEM - Puppet staleness on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [06:13:34] 06Labs, 10DBA, 10Horizon: TGR unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2307795 (10jcrespo) a:05jcrespo>03None > "Chris suggests that something should be done to the 'users' table on silver" What should be done? I have 0 context here. Please do not assign it to me, apply th... [11:05:20] any admin around? [11:50:18] !log wikilabels manually rebooting wikilabels-staging-01.wikilabels.eqiad.wmflabs [11:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [12:24:43] andrewbogott: are you currently here? [12:44:58] !log deploying e42885d to wikilabels [12:44:58] deploying is not a valid project. [12:51:04] !log wikilabels deploying e42885d to wikilabels [12:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [12:53:25] 06Labs, 10labs-sprint-116, 10DBA, 13Patch-For-Review: Make watchlist table available on labs - https://phabricator.wikimedia.org/T59617#2308672 (10jcrespo) [13:07:43] 06Labs: Investigate labnet1002 kernel panic - https://phabricator.wikimedia.org/T135322#2308774 (10MoritzMuehlenhoff) 05Open>03Resolved [13:12:27] !log tools reboot tools-exec-1220 stuck in state of unresponsivenss [13:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:13:27] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:14:02] 06Labs, 10Tool-Labs: Figure out a way to support java 1.8 on tool labs (Merl's bot) - https://phabricator.wikimedia.org/T121279#2308803 (10BBlack) @Merl - Note: T105794#2294355 - deadlines for HTTPS are coming soon. MerlBot continues to be on the shortlist of bots making the highest numbers of insecure reques... [13:15:55] chasemp: Currently not busy? [13:16:08] currently busy, what's up? [13:17:02] chasemp: I want to request one floating IP for my project. I'm currently trying to setup an alternative for the actual IRC software for IRC-RC, as it is planned, and want to connect from outside too, to make tests [13:18:06] I don't think that will be an issue but can you make a task so we have some history for the allocation? and @ me and andrew please [13:18:36] ok [13:19:31] 06Labs: Request of one floating IP for project 'rcm' - https://phabricator.wikimedia.org/T135730#2308832 (10Luke081515) [13:19:35] chasemp: ^ [13:28:00] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:29:33] Luke081515: okidoke I'll discuss it w/ andrew when he's around [13:31:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:33:38] chasemp: ok, thanks [13:45:32] 06Labs, 10Labs-Sprint-115, 10Tool-Labs, 10labs-sprint-116, and 2 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#2308972 (10zhuyifei1999) [13:54:45] chasemp: what is the difference between -l release=trusty and -l release=precise ? [13:55:52] doctaxon: one specifies you want your job run on a precise host, the the other a trusty host [13:56:33] RECOVERY - Puppet run on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [13:57:06] chasemp: yes, I know. But what are the differences between precise and trusty ? [13:58:09] doctaxon: that question doesn't have an answer you want afaik, they are different releases of the same distro so packages, kernel, etc [13:58:21] you can ask if a package is a different version and I could look that up [14:00:41] ah okay [14:03:26] 06Labs, 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2309076 (10Joe) [14:03:34] 06Labs, 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2307710 (10Joe) p:05Triage>03Normal [14:04:03] 06Labs, 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2307710 (10Joe) p:05Normal>03Low [14:04:40] 06Labs, 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2307710 (10Joe) a:03Joe [14:10:05] 06Labs, 10Beta-Cluster-Infrastructure, 06Operations, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2309108 (10hashar) I have kept this one open in case #traffic want to investigate whether it can be a problem in production. For beta, the resta... [14:36:14] PROBLEM - SSH on tools-webgrid-lighttpd-1408 is CRITICAL: Server answer [14:45:26] 10PAWS, 10Jupyter-Hub: I can't login my bot in JUPYTER - https://phabricator.wikimedia.org/T135306#2309220 (10Maathavan) [14:53:29] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [14:54:34] 10PAWS, 10Jupyter-Hub: I can't login my bot in JUPYTER - https://phabricator.wikimedia.org/T135306#2309281 (10yuvipanda) p:05Triage>03High Thank you for reporting this! Am looking into this right now. [14:54:49] chasemp: worker nodes still have stale /public/dumps [14:55:14] ah I grepped out exec and web that's why [14:55:20] I have to hop on a call but [14:56:23] "/bin/fuser -k /public/dumps; /bin/umount -f /public/dumps; /bin/mount -a; timeout 5s file /public/dumps/incr; echo $" [14:56:26] YuviPanda: can you run that^ [14:57:49] yeah doing [14:58:00] "/bin/fuser -k /public/dumps; /bin/umount -f /public/dumps; /bin/mount -a; timeout 5s test -e /public/dumps/incr; echo $" [14:58:00] file => test -e [14:58:07] file exits 0 even if there is nothing there :) [14:58:15] not a functional change tho for remount itself [14:59:05] chasemp: do you know, at which time we get the new 16.04 ubuntu image? is that planned? [14:59:21] Luke081515: nope, not planned afaik. just debian from now [14:59:44] You sure you want "echo $"? [14:59:51] YuviPanda: but for example vagrant won't run at debian [14:59:57] if you take a look at the docu [15:00:31] Luke081515: mediawiki vagrant? it will at some point, since mediawiki in production is also moving to debian [15:00:36] ah it cut off 'echo $?' Leah :) [15:00:41] hm [15:00:44] :-) [15:01:14] "mediawiki in production is also moving to debian" -- someday [15:01:33] bd808: soon. _joe_ is working on it now I think, since he has new appservers and they're debian [15:01:47] chasemp: hmm, an 'ls /public' still hangs [15:01:49] oh cool. I had not heard anything about it [15:02:14] I guess that means there is a big fixup project ahead for mw-vagrant [15:02:33] I know there are a lot of upstart jobs in there [15:08:35] chasemp: I think they're all fucked now. [15:08:42] all? [15:08:45] I'm on a call atm [15:08:46] what is all [15:08:52] chasemp: kk I"ll investigate. all as in all worker nodes [15:09:16] chasemp: no emergency etc. keep at call [15:10:37] 10PAWS, 10Jupyter-Hub: I can't login my bot in JUPYTER - https://phabricator.wikimedia.org/T135306#2309373 (10yuvipanda) @Maathavan does this work for you now? [15:18:23] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Prevent breaking puppet on the dnsrecursor host when novaadmin isn't in a project - https://phabricator.wikimedia.org/T133946#2309403 (10Krenair) 05Open>03declined We decided against this - puppet breaking is fine and better than silently failing. [15:25:52] YuviPanda: back now, what's the deal? [15:26:09] chasemp: tools-worker-1001 [15:26:26] anything that touches dumps gets stuck [15:26:37] (am also eating with one hand atm) [15:28:20] uh yeah wth [15:29:26] YuviPanda: may have to reboot? [15:29:33] I don't get the current state at all [15:29:49] chasemp: yeah ok. i depooled anyway [15:30:00] chasemp: should we switch it to soft before? [15:30:08] or doesn't matter... [15:30:21] I'm coming in w/ a bit of refactor and that next across all for scratch and dumps I hope [15:30:32] but I'm doing a bit of testing on what the outcome is for changing nfs options on teh fly [15:30:37] i.e. remount etc [15:30:47] there is a race condition with it, but I'm unsure if we care [15:30:52] long story :) [15:31:29] also I have a different option set, you have timeo=10 [15:31:39] did you know that's 1/10th of a second? [15:31:45] hahah really? [15:31:46] so that means it times out now at .1 second [15:31:47] no [15:31:48] yes [15:31:55] funny shit [15:32:05] chasemp: can you reboot btw? [15:32:08] so I'm thinking timeout=100,retrans=3 [15:32:09] am doing so [15:32:45] I even got it to unmount [15:32:50] but remounting for me or puppet was hanging [15:32:53] so it' ssome kind of bug [15:32:56] as that's not usual [15:33:39] fun [15:34:42] it's back and fine now [15:35:04] but jessie has some unique issue here [15:35:06] yeah and it's pooled itself too [15:35:12] (as it is designed to) [15:36:41] so I've been asking a lot about storage for candidates YuviPanda and the only one that really understood our overload issue w/ every client being able to do unlimited io and killing the cluster has been T [15:36:48] and he basiclly said, yes it was a thing but they never solved it [15:36:57] even ceph folks are kind of bewildered [15:37:10] heh [15:37:17] handing out direct storage to customers in this way seems very usual [15:37:24] chasemp: btw, tools-worker-1012 has a different type of nfs issue now [15:37:30] chasemp: /public/dumps won't mount [15:37:36] even tho it isn't mounted [15:37:46] does it hang? [15:37:50] mount -a never returns [15:37:51] chasemp: ya [15:37:55] ya [15:37:56] same issue then [15:38:05] idk reboot is all I know for sure to fix it [15:38:16] atm [15:38:19] I guess this reinforces my idea that touching anything NFS is never ever simple :) [15:38:27] ya rebooting [15:39:14] so this relats to mystery /mnt/nfs :) [15:39:28] I'm looking at putting all nfs mounts under /mnt/nfs as I don't like them at top level like /home [15:39:35] chasemp: looks like I have to reboot all the k8s workers now :( [15:39:40] YuviPanda: hm ok [15:39:47] I'll try to keep in mind to test on jessie too for things [15:40:00] chasemp: or not, since a reboot itself hangs...?! [15:40:05] at least a sudo reboot hangs [15:40:11] huh [15:41:02] I'm just going to reboot from horizon now [15:41:06] ok man [15:41:19] aaaa except I can't since I just wiped my phone again this morning. [15:41:24] I wish I had a good answer here [15:41:38] (03PS1) 10BryanDavis: jsub: validate '-l release=...' arguments [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 [15:41:42] I'm going to reboot from nova [15:41:47] heh ok [15:42:17] (03Abandoned) 10BryanDavis: jsub: Add support for qsub args used by tools.merlbot [labs/toollabs] - 10https://gerrit.wikimedia.org/r/288226 (https://phabricator.wikimedia.org/T135006) (owner: 10BryanDavis) [15:43:18] !log tools rebooting all tools worker instances [15:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:43:25] (03CR) 10Rush: [C: 031] "not tested but looks sane to me, thanks bryan" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 (owner: 10BryanDavis) [15:43:31] (03CR) 10jenkins-bot: [V: 04-1] jsub: validate '-l release=...' arguments [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 (owner: 10BryanDavis) [15:44:15] PROBLEM - SSH on tools-worker-1004 is CRITICAL: Connection refused [15:46:18] PROBLEM - SSH on tools-worker-1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:16] RECOVERY - SSH on tools-worker-1004 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [15:51:08] RECOVERY - SSH on tools-worker-1009 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [16:02:03] YuviPanda: that reminds me of https://phabricator.wikimedia.org/T130577, in that case sudo reboot itself also hangs, but it's not an nfs instance [16:04:18] (and sudo kill -9 on hanging processes does nothing) [16:24:06] 06Labs: Request of one floating IP for project 'rcm' - https://phabricator.wikimedia.org/T135730#2309644 (10Andrew) 05Open>03Resolved a:03Andrew ok, done. [16:43:52] 10PAWS, 10Jupyter-Hub: I can't login my bot in JUPYTER - https://phabricator.wikimedia.org/T135306#2309716 (10yuvipanda) I have had to restart all the notebooks due to some NFS issues. Re-logging in should work now. Apologies for the issues! [17:06:39] !log tools.hatjitsu Stopped competing web jobs; started with `webservice nodejs start` [17:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.hatjitsu/SAL, Master [18:23:06] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:28:46] (03CR) 10Yuvipanda: [C: 031] "Needs a changelog bump for deployment tho." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 (owner: 10BryanDavis) [19:34:14] (03PS3) 10BryanDavis: jsub: validate '-l release=...' arguments [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 [19:38:25] (03CR) 10Yuvipanda: [C: 032] jsub: validate '-l release=...' arguments [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 (owner: 10BryanDavis) [19:38:36] bd808: \o/ will you do the deploy at some point? [19:38:56] (03Merged) 10jenkins-bot: jsub: validate '-l release=...' arguments [labs/toollabs] - 10https://gerrit.wikimedia.org/r/289677 (owner: 10BryanDavis) [19:40:27] YuviPanda: Yeah. Let me see if I can follow the instructions I wrote but didn't actually test :) [19:44:30] where do I get a list of all the bastions? Just from looking at the instance list? [19:45:00] Bd808 yeah [19:45:12] Easiest I think [19:52:08] YuviPanda: aptly error -- ERROR: unable to publish: unable to process packages: error linking file to /srv/packages/public/pool/main/t/tools-manifest/tools-manifest_0.6_all.deb: file already exists and is different [19:57:17] bd808 ugh. Just RM that file for now [19:57:21] That makes it work [19:57:29] I need to fix that at some point [19:57:37] I have an idea of what causes it [19:58:06] A mistake a I made early on where I tried to re-add the same deb version for two different builds [19:58:06] aptly doesn't seem to like getting 2 builds of a universal package [19:58:21] It succeeded and then doesn't stop complaining since [19:58:57] Really? If I RM it it just works usually [19:59:56] you rm target file? [20:00:57] The file it complains about [20:01:25] The old tools manifest deb [20:01:34] ok [20:02:40] I'm pretty sure the problem is that `find /srv/packages/pool -name tools-manifest_0.6_all.deb` returns 2 _all pcakges [20:03:07] I just caused the same thing for this new build :) [20:03:12] Haha [20:03:15] Fun [20:03:39] Debs are from a different era imo [20:03:57] For things like this that is [20:05:04] blarg [20:05:25] now I've got fresh problems. /me debugs [20:05:40] I rm'd a dup and made the index mad [20:10:01] All better I think [20:14:47] \o/ [20:14:54] bd808: can you add it to the Aptly page on wikitech? [20:29:22] YuviPanda: https://wikitech.wikimedia.org/w/index.php?title=Aptly&diff=544645&oldid=501052 [20:30:02] bd808: <3 thanks [20:31:49] aargh. load on bastion-02 is 66 [20:32:43] why are there 61 copies of sessionclean running? [20:33:15] I know this bug from somewhere ... [20:34:16] Ah T73645 [20:34:17] T73645: php5 session cleanup script can go nuts - https://phabricator.wikimedia.org/T73645 [20:34:33] Which I fixed in mw-vagrant with https://gerrit.wikimedia.org/r/#/c/164877/ [21:07:21] !log tools deployed jobutils 1.13 on bastions; now with '-l release=...' validation! [21:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:35:02] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [23:13:58] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2311122 (10bd808) [23:15:05] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2311140 (10bd808) [23:15:20] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2311122 (10bd808) [23:15:24] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2226182 (10bd808) [23:17:27] YuviPanda, so I just had a very strange experience when I logged into my cyberbot exec node on my cyberbot project using WinSCP [23:18:07] It SSH'd into Bastion just fine, and then proceeded to continue to the exec node only to tell my the fingerprint on it didn't match the currently saved one. [23:18:22] I aborted, naturally [23:18:30] When I tried again, it worked just fine. [23:19:05] Since it went into Bastion successfully, I'm assuming something there went wrong momentarily. [23:19:23] I'm unable to replicate it, but I thought you should know