[00:00:34] <madhuvishy>	 annika: great, i've removed the erroneous symlink
[00:02:24] <annika>	 madhuvishy: chasemp: thank you for your quick help!
[00:04:03] <annika>	 doctaxon: everything's back now and working
[00:06:28] <Ocaasi>	 Having trouble with Wikipedia Library Card platform suddenly. It wasn't on the planned list, but we're getting server errors:
[00:06:29] <Ocaasi>	 1) Failure: https://twl-test.wmflabs.org/oauth/login/?next=/users/test_permission/; 2) Server: https://wikitech.wikimedia.org/wiki/Nova_Resource:Twlight-test.twl.eqiad.wmflabs; 3) Github: https://github.com/thatandromeda/TWLight
[00:06:29] <Ocaasi>	 Any reason this would have affected us if we're not on the list anywhere?
[00:06:39] <Ocaasi>	 Also hi :)
[00:07:53] <Ocaasi>	 Or is the solution just to do a manual restart anyway?
[00:08:44] <tgr>	 that error error message is not super informative
[00:09:26] <tgr>	 but a manual restart is a good first action
[00:09:34] <madhuvishy>	 Ocaasi: hello
[00:09:51] <tgr>	 possibly whatever service you are running is down and the web server cannot connect to it
[00:09:54] <madhuvishy>	 yes there were a bunch of unexpected outcomes that affected other instances
[00:10:11] <madhuvishy>	 i second tgr and recommend trying to restart the service too
[00:13:46] <Ocaasi>	 ok, thank you. will try that first.
[00:14:58] <madhuvishy>	 Ocaasi: i checked on the instance - and it seems like it's missing /home directories - not sure how it got missed on our recovery lists yesterday
[00:15:04] <madhuvishy>	 i'm recovering now
[00:15:52] <madhuvishy>	 possibly cause of service disruption
[00:16:07] <Ocaasi>	 thank you!
[00:16:55] <madhuvishy>	 Ocaasi: no problem, and sorry for the trouble. will ping you once it gets done
[00:18:01] <Ocaasi>	 thanks again.  stuff happens!
[00:37:00] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2954890 (10Ocaasi_WMF) Apparently we were missed on the list and therefore not rebooted, so it's being recovered now.  Should fix it most likely.  Thanks!  -Jake
[00:46:02] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[00:50:52] <bd808>	 !log tools sudo qdel -f 1199218 to force delete a stuck toolschecker job
[00:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[00:50:57] <bd808>	 madhuvishy: ^ that may help
[00:51:14] <madhuvishy>	 bd808: i did that too :)
[01:14:42] <CP678|Direct>	 yuvipanda: ping
[01:14:57] <madhuvishy>	 The restore for twlight-test is still going on. I'll be afk for a couple of hours, but respond on phone if anything's up with home directories etc
[01:15:23] <CP678|Direct>	 madhuvishy: did this issue in anyway affect the cyberbot-exec-01 instance?
[01:15:51] <CP678|Direct>	 Because since the file permissions are completely screwed up now.
[01:16:17] <bd808>	 CP678|Direct: yeah. apparently the /home restores didn't fix all perms
[01:16:23] <CP678|Direct>	 My bots are crashing because of a permissions error.  It can't access the Peachy framework.
[01:16:33] <bd808>	 can you put them back yourself or do you need some help?
[01:16:36] <madhuvishy>	 CP678|Direct: looking
[01:16:37] * CP678|Direct is trying to fix them.
[01:17:35] <madhuvishy>	 CP678|Direct: yup it was one of the affected instances
[01:17:45] <madhuvishy>	 do you have access to your home?
[01:18:19] <CP678|Direct>	 madhuvishy: I'm working on it, give me a sec.  Was cyberbot-exec-iabot-01 also affected?
[01:18:29] <CP678|Direct>	 Because that's InternetArchiveBot's home.
[01:18:47] <madhuvishy>	 CP678|Direct: ah yes
[01:19:18] <madhuvishy>	 there might have been some issues there with permissions not being able to be granted to user IABot
[01:19:29] <madhuvishy>	 I can fix some of it from my end
[01:19:43] <CP678|Direct>	 Ugh, permission denied.
[01:23:45] <CP678|Direct>	 andrewbogott: why do I see a home folder for you in my instance?
[01:24:34] <madhuvishy>	 CP678|Direct: you should have access now
[01:24:47] <madhuvishy>	 he might have logged in for testing
[01:24:52] <bd808>	 CP678|Direct: probably because he ssh'd in to fix puppet or do a kernel patch at some point
[01:25:01] <CP678|Direct>	 madhuvishy: thanks
[01:25:08] <CP678|Direct>	 bd808: I see.
[01:25:12] <bd808>	 expect that any of the Labs team will access any instance at any time
[01:25:16] <madhuvishy>	 I wasn't able to fix the IABot folder
[01:25:19] <madhuvishy>	 not sure why
[01:25:34] <CP678|Direct>	 madhuvishy: I believe IABot is owned by root.
[01:25:40] <madhuvishy>	 yes right now
[01:25:48] <CP678|Direct>	 So I'll have to fix that myself.
[01:26:01] <madhuvishy>	 okay
[01:26:13] <madhuvishy>	 executable permissions have been lost
[01:26:42] <madhuvishy>	 you will probably have to fix that with chmod +x
[01:27:15] <CP678|Direct>	 madhuvishy: I'm resetting all permissions.
[01:27:16] <madhuvishy>	 I've gotta run, let me know if it goes okay! sorry for the trouble
[01:27:22] <CP678|Direct>	 Just to ensure consistency
[01:27:35] <CP678|Direct>	 madhuvishy: thanks for your help
[01:37:00] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[01:40:15] <CP678|Direct>	 madhuvishy: the restore deleted a bunch of files.
[01:40:40] <CP678|Direct>	 permissions are no longer an issue but a whole bunch of Peachy Includes are missing.
[01:44:02] <CP678|Direct>	 I have restored the missing files from my local copy, but really, WTH?
[01:46:17] <CP678|Direct>	 Oh lovely, half of my bot scripts are missing too.
[01:47:01] <CP678|Direct>	 I'm surprised the linux OS on that instance is still intact
[01:49:07] <CP678|Direct>	 Well that was fun
[01:52:50] <Cyberpower678>	 madhuvishy: when you have the chance can you investigate what happened to my exec node?  Literally 60% of my bot files were gone and had to be restored.
[02:42:01] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[03:08:00] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[03:37:42] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[04:17:43] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[04:38:44] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[05:15:01] <wikibugs>	 10Quarry: Explain command forces Quarry to keep running endlessly - https://phabricator.wikimedia.org/T155808#2955224 (10Soni)
[05:18:42] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:13:03] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:39:02] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[06:39:42] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[07:15:32] <wikibugs>	 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2955290 (10yuvipanda) @akosiaris hmm I'd really like to keep the pin in puppet - there's enough uncertainity as is without having to find docker version mis...
[07:30:07] <wikibugs>	 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2955295 (10yuvipanda) If we have only one version it also means we are tying the prod and tools versions together forever, with upgrades needing to happen a...
[07:44:44] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:32:35] <wikibugs>	 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955358 (10Kenrick95)
[08:42:53] <Amir1>	 !log wikilabels deploying 15f7a42 to staging
[08:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL
[08:44:02] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:09:59] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[09:16:59] <Amir1>	 !log wikilabels deploying 15f7a42 to prod
[09:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL
[09:38:29] <wikibugs>	 06Labs, 10Tool-Labs: Mail from cron regarding a failure of jsub - https://phabricator.wikimedia.org/T155787#2955533 (10MarcoAurelio)
[09:42:46] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA: Provisioning MySQL replica users fails on tool labs - https://phabricator.wikimedia.org/T151014#2955540 (10Marostegui) @yuvipanda ok to close this ticket?
[09:51:25] <wikibugs>	 10Tool-Labs-tools-LTA-Knowledgebase: Create password change function - https://phabricator.wikimedia.org/T155675#2955545 (10Legoktm) >>! In T155675#2951700, @Samtar wrote: > @Legoktm not really, I was modelling the account request/login structure off of UTRS - to be honest I don't understand how OAuth would be u...
[10:10:40] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[10:13:47] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955587 (10hashar)
[10:15:03] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[10:15:43] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955587 (10yuvipanda) Is this running the latest puppet code?
[10:20:20] <wikibugs>	 06Labs, 10Tool-Labs, 10Tools-Kubernetes: Reassign service/pod IP ranges for kubernetes on tool labs - https://phabricator.wikimedia.org/T152399#2955613 (10yuvipanda) What we need to do:  1. Verify that `192.168.0.0/16` is a good range to use for pod IPs. We currently allocate a /24 out of this to each kubern...
[10:29:06] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955638 (10hashar) The puppetmaster was stall with 2-3 days of lag and I rebased yesterday just...
[10:31:53] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955641 (10yuvipanda) Yes, there's probably going to be  a refactor at some point for that. Doe...
[10:41:00] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[10:51:32] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955655 (10hashar) On `integration-slave-jessie-1001`  ``` # /usr/local/sbin/nfs-mount-manager...
[10:59:25] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955675 (10hashar) That matches the hosts failing puppet: ``` # salt --ou...
[11:02:03] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955685 (10hashar) https://gerrit.wikimedia.org/r/#/c/333230/ cherry pick...
[11:03:53] <wikibugs>	 06Labs, 10Tool-Labs: Mail from cron regarding a failure of jsub - https://phabricator.wikimedia.org/T155787#2955686 (10zhuyifei1999) a:03zhuyifei1999 The error was probably from the NFS outage, but the error reporting is obviously broken: ``` >>> try: ...     f = open('/dev/full', 'w'); f.write('a'); f.close...
[11:15:33] <wikibugs>	 (03PS1) 10Zhuyifei1999: jsub: Change IOError string substitution conversion from '%e' to '%s' [labs/toollabs] - 10https://gerrit.wikimedia.org/r/333232 (https://phabricator.wikimedia.org/T155787)
[11:15:39] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:39:18] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 07Beta-Cluster-reproducible, 07Puppet: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2955738 (10hashar) That is still happening. Happened today when creat...
[11:49:51] <wikibugs>	 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955358 (10Mattho69) Same issue for me
[12:01:03] <zhuyifei1999_>	 !log video restarting all currently running v2c workers (celery{1,2}@encoding0{1,2}) and web frontend due to NFS outage, and we aren't very sure which component is erroring (T155803)
[12:01:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL
[12:01:06] <stashbot>	 T155803: ERROR IOError: {Error no 116] Stale file handle - https://phabricator.wikimedia.org/T155803
[12:06:44] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[13:21:53] <wikibugs>	 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955942 (10yuvipanda) can you go to control panel, stop your server and start it again to see if it still persists? We had some NFS issues earlier that should be fixed now...
[13:22:13] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA: Provisioning MySQL replica users fails on tool labs - https://phabricator.wikimedia.org/T151014#2955943 (10yuvipanda) Yup!
[13:23:45] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA: Provisioning MySQL replica users fails on tool labs - https://phabricator.wikimedia.org/T151014#2955946 (10Marostegui) 05Open>03Resolved
[13:30:52] <wikibugs>	 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955953 (10Kenrick95) 05Open>03Resolved It works now after stopping and starting the server. Thanks :)
[14:50:21] <wikibugs>	 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2757207 (10Joe) Today I wanted to go around horizon to check and refactor hiera keys before merging https://gerrit.wikimedia.org/r/#/c/332355/.  It was a very frustrating experience, and I think it is a good thing to just rep...
[14:50:28] <wikibugs>	 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2956043 (10Joe) p:05Triage>03Unbreak!
[14:51:08] <wikibugs>	 06Labs, 06Operations, 07Puppet: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2956045 (10Joe)
[15:03:36] <Wurgl>	 zhuyifei1999_: Thanks for T155787
[15:03:37] <stashbot>	 T155787: Mail from cron regarding a failure of jsub - https://phabricator.wikimedia.org/T155787
[15:03:54] <zhuyifei1999_>	 np
[15:04:36] <Wurgl>	 zhuyifei1999_: First phabriactor Task and (almost) fixed within hours. I am happy :-D
[15:05:40] <zhuyifei1999_>	 it's not fixed though
[15:06:01] <zhuyifei1999_>	 it was a temporary NFS outage
[15:06:14] <zhuyifei1999_>	 the error reporting broke
[15:06:15] <Wurgl>	 almost � the review is missing and the roll out
[15:06:43] <zhuyifei1999_>	 and the patch only addresses the error reporting
[15:07:27] <Wurgl>	 No problem with the error itself. One task starts every 10 minutes, so it does not matter and I the other one by hand ~ 1 hour later and it worked.
[15:36:31] <yuvipanda>	 !log tools disable puppet everywhere to cherrypick patch moving base to a profile
[15:36:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:37:17] <yuvipanda>	 joe: done
[15:37:31] <_joe_>	 cool
[15:37:39] <_joe_>	 cherrypicking now
[15:39:06] <_joe_>	 running puppet on the puppetmaster first
[15:39:18] <_joe_>	 let's see if I did everything right :)
[15:39:19] <_joe_>	 uhm
[15:39:31] <_joe_>	 +# Don't allow people to forward their agents either.
[15:39:32] <_joe_>	 +AllowAgentForwarding no
[15:39:37] <_joe_>	 where is this configured
[15:39:50] <yuvipanda>	 joe: labs.yaml
[15:39:53] <yuvipanda>	 joe: in ops/puppet
[15:40:04] <_joe_>	 ahhh and of course since it's a single setting
[15:40:08] <_joe_>	 it's overridden
[15:40:11] <_joe_>	 ok, right
[15:40:13] <_joe_>	 meh
[15:40:15] <_joe_>	 fixing it
[15:40:41] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[15:41:27] <yuvipanda>	 not sure where ^ is from, I see that it's disabled cross fleet
[15:41:46] <_joe_>	 yuvipanda: I'll look in a few
[15:42:07] <yuvipanda>	 _joe_: no, it's strange because puppet shouldn't be running there at all
[15:42:23] <_joe_>	 yuvipanda: it might have been running while you disabled it
[15:42:32] <yuvipanda>	 joe: ah, good call
[15:42:45] <_joe_>	 yuvipanda: what's the cache time of the mwyaml backend?
[15:42:48] <_joe_>	 1 minute?
[15:42:59] <yuvipanda>	 joe: checking...
[15:43:08] <yuvipanda>	 damn, my emacs locked up again
[15:44:56] <yuvipanda>	 joe: yes 60
[15:44:56] <yuvipanda>	 s
[15:45:09] <_joe_>	 yuvipanda: ok I think it's properly a noop now
[15:45:24] <yuvipanda>	 joe: ok! wanna test on another host too?
[15:45:34] <zhuyifei1999_>	 yuvipanda: when you have time, could you look at https://gerrit.wikimedia.org/r/#/c/333232/ ?
[15:45:40] <_joe_>	 I'm doing it
[15:45:58] <yuvipanda>	 zhuyifei1999_: will do!
[15:46:01] <wikibugs>	 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2879938 (10MoritzMuehlenhoff) But reprepro somewhat supports multiple versions as long as they're stored in different sections (or whatever the exact termin...
[15:47:02] <wikibugs>	 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2956102 (10yuvipanda) We'll have to create maybe a 'labs' section in reprepo and use it?
[15:47:31] <_joe_>	 yuvipanda: running now on tools-exec-1411 specifically
[15:47:55] <yuvipanda>	 joe: ok!
[15:48:20] <_joe_>	 noop!
[15:48:28] <wikibugs>	 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2956105 (10MoritzMuehlenhoff) That or maybe "staging" to make it a little more generic.
[15:49:58] <yuvipanda>	 joe: ok, shall I start enabling puppet group by group?
[15:50:08] <yuvipanda>	 wanna test it on, say, bastion first (it has more ssh stuff)
[15:54:34] <_joe_>	 yuvipanda: yep, definitely
[15:54:39] <_joe_>	 more ssh stuff? where?
[15:54:44] <_joe_>	 I couldn't find it
[15:54:46] <yuvipanda>	 _joe_: noop! I'm going to test the grid master now, and then just enable it
[15:54:54] <_joe_>	 test the bastions
[15:55:02] <_joe_>	 if they have specific ssh things
[15:55:05] <yuvipanda>	 joe: yep, did. was a noop
[15:55:20] <yuvipanda>	 _joe_: nope, they didn't! I must've remembered wrong
[15:55:32] <_joe_>	 ok cool
[15:55:41] <_joe_>	 ssh is the only thing that could be problematic
[15:58:18] <yuvipanda>	 joe: ok seems good. going to deploy across now
[15:58:40] <yuvipanda>	 !log tools enabling puppet across all hosts
[15:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:00:40] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:20:38] <wikibugs>	 06Labs: labstore1004 is high load and periodic unavailability to icinga - https://phabricator.wikimedia.org/T155832#2956182 (10Paladox)
[17:00:04] <doctaxon>	 doing a revision import API gives me the return: import {{ns 0 title {D-Day Dodgers} revisions 123}}  <- but there has been no revision to dewiki imported. I notice this phenomenon more and more frequently. The next import job works well again. Is there anything wrong with import API? 
[17:46:01] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:14:13] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[18:14:17] <shinken-wm>	 PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[18:16:55] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[18:18:09] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[18:18:10] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[18:22:08] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[18:22:54] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[18:23:10] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[18:23:52] <wikibugs>	 10Tool-Labs-tools-LTA-Knowledgebase: Migrate to OAuth - https://phabricator.wikimedia.org/T155841#2956430 (10Samtar)
[18:24:10] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[18:24:18] <yuvipanda>	 hmmm
[18:24:19] <yuvipanda>	 not sure what's going on
[18:25:22] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[18:26:04] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[18:27:07] <andrewbogott>	 yuvipanda: can you give the wikimedia api a nudge before you punch out?  It's been 80 minutes now with no update :(
[18:28:33] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[18:29:20] <shinken-wm>	 RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0]
[18:29:36] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[18:33:10] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:37:01] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[18:37:47] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 07Beta-Cluster-reproducible, 07Puppet: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2736876 (10scfc) (T152941 is slightly related, but refers to the case...
[18:39:09] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:43:38] <ragesoss>	 yuvipanda: you're very familiar with Meteor, right?
[18:53:10] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:54:12] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:56:52] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0]
[18:57:08] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:57:52] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:03:09] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:03:31] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:04:37] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:05:21] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:06:03] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:19:09] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-exec-1221 is CRITICAL: CRITICAL: tools.tools-exec-1221.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-exec-1221.diskspace.root.byte_percentfree (<11.11%)
[19:36:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Update novaobserver passwd [labs/private] - 10https://gerrit.wikimedia.org/r/333290
[19:38:19] <wikibugs>	 (03CR) 10Alex Monk: [V: 032 C: 032] Update novaobserver passwd [labs/private] - 10https://gerrit.wikimedia.org/r/333290 (owner: 10Andrew Bogott)
[19:43:51] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] Add fake clushuser keypair [labs/private] - 10https://gerrit.wikimedia.org/r/325050 (owner: 10Merlijn van Deen)
[19:44:08] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] Add tools hiera common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/325041 (owner: 10Merlijn van Deen)
[20:13:54] <tobias47n9e>	 Is the unique user id in the oauth return value the "exp", "sub", "iat" or "aud"? Where can I look up these values?
[20:33:26] <bd808>	 tobias47n9e: https://www.mediawiki.org/wiki/Extension:OAuth#Identify_the_User_.28optional.29
[20:33:44] <bd808>	 looks like "sub" is the user_id
[20:35:00] <tobias47n9e>	 bd808: Thanks didn't read that page before.
[20:35:41] <bd808>	 there is some good stuff there and in the sub pages. Not all of our docs are horrible :)
[20:36:13] <tobias47n9e>	 bd808 By the way. Can someone from the mediawiki technical team review my Django oauth backend? It would be nice to know if it handles security and privacy correctly
[20:37:22] <bd808>	 I'd be glad to take a look. where is your source?
[20:39:24] <tobias47n9e>	 bd808: I can push the branch in a few hours and ping you. Doesn't have to be reviewed today, but next week would be great.
[20:42:02] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:56:32] <bd808>	 tobias47n9e: a phab task assigned to me would help. I'm juggling a lot of stuff right now but I can make some time to help out :)
[20:59:19] <wikibugs>	 06Labs: labstore1004 is high load and periodic unavailability to icinga - https://phabricator.wikimedia.org/T155832#2956817 (10chasemp)
[21:00:48] <wikibugs>	 06Labs: labstore1004 is high load and periodic unavailability to icinga - https://phabricator.wikimedia.org/T155832#2956151 (10chasemp) Working from the rp_filter confusion theory and the ongoing icinga issues (somewhere between 2-4 host unavailable warnings per hour over the previous 8 hours at least) I shutdow...
[21:08:02] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[21:25:22] <wikibugs>	 10Tool-Labs-tools-LTA-Knowledgebase: Add password confirmation box - https://phabricator.wikimedia.org/T155854#2956884 (10Mike1901)
[21:53:03] <wikibugs>	 10Tool-Labs-tools-LTA-Knowledgebase: Add password confirmation box - https://phabricator.wikimedia.org/T155854#2957007 (10Samtar) 05Open>03Resolved
[22:18:01] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:30:14] <wikibugs>	 10Tool-Labs-tools-LTA-Knowledgebase: Require confirmation diff on account request - https://phabricator.wikimedia.org/T155704#2957111 (10Samtar) 05Open>03Resolved
[22:40:57] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-exec-1220 is CRITICAL: CRITICAL: tools.tools-exec-1220.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-exec-1220.diskspace.root.byte_percentfree (<50.00%)
[23:09:03] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]
[23:10:32] * Zppix brb