[00:00:34] annika: great, i've removed the erroneous symlink [00:02:24] madhuvishy: chasemp: thank you for your quick help! [00:04:03] doctaxon: everything's back now and working [00:06:28] Having trouble with Wikipedia Library Card platform suddenly. It wasn't on the planned list, but we're getting server errors: [00:06:29] 1) Failure: https://twl-test.wmflabs.org/oauth/login/?next=/users/test_permission/; 2) Server: https://wikitech.wikimedia.org/wiki/Nova_Resource:Twlight-test.twl.eqiad.wmflabs; 3) Github: https://github.com/thatandromeda/TWLight [00:06:29] Any reason this would have affected us if we're not on the list anywhere? [00:06:39] Also hi :) [00:07:53] Or is the solution just to do a manual restart anyway? [00:08:44] that error error message is not super informative [00:09:26] but a manual restart is a good first action [00:09:34] Ocaasi: hello [00:09:51] possibly whatever service you are running is down and the web server cannot connect to it [00:09:54] yes there were a bunch of unexpected outcomes that affected other instances [00:10:11] i second tgr and recommend trying to restart the service too [00:13:46] ok, thank you. will try that first. [00:14:58] Ocaasi: i checked on the instance - and it seems like it's missing /home directories - not sure how it got missed on our recovery lists yesterday [00:15:04] i'm recovering now [00:15:52] possibly cause of service disruption [00:16:07] thank you! [00:16:55] Ocaasi: no problem, and sorry for the trouble. will ping you once it gets done [00:18:01] thanks again. stuff happens! [00:37:00] 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2954890 (10Ocaasi_WMF) Apparently we were missed on the list and therefore not rebooted, so it's being recovered now. Should fix it most likely. Thanks! -Jake [00:46:02] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:50:52] !log tools sudo qdel -f 1199218 to force delete a stuck toolschecker job [00:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:50:57] madhuvishy: ^ that may help [00:51:14] bd808: i did that too :) [01:14:42] yuvipanda: ping [01:14:57] The restore for twlight-test is still going on. I'll be afk for a couple of hours, but respond on phone if anything's up with home directories etc [01:15:23] madhuvishy: did this issue in anyway affect the cyberbot-exec-01 instance? [01:15:51] Because since the file permissions are completely screwed up now. [01:16:17] CP678|Direct: yeah. apparently the /home restores didn't fix all perms [01:16:23] My bots are crashing because of a permissions error. It can't access the Peachy framework. [01:16:33] can you put them back yourself or do you need some help? [01:16:36] CP678|Direct: looking [01:16:37] * CP678|Direct is trying to fix them. [01:17:35] CP678|Direct: yup it was one of the affected instances [01:17:45] do you have access to your home? [01:18:19] madhuvishy: I'm working on it, give me a sec. Was cyberbot-exec-iabot-01 also affected? [01:18:29] Because that's InternetArchiveBot's home. [01:18:47] CP678|Direct: ah yes [01:19:18] there might have been some issues there with permissions not being able to be granted to user IABot [01:19:29] I can fix some of it from my end [01:19:43] Ugh, permission denied. [01:23:45] andrewbogott: why do I see a home folder for you in my instance? [01:24:34] CP678|Direct: you should have access now [01:24:47] he might have logged in for testing [01:24:52] CP678|Direct: probably because he ssh'd in to fix puppet or do a kernel patch at some point [01:25:01] madhuvishy: thanks [01:25:08] bd808: I see. [01:25:12] expect that any of the Labs team will access any instance at any time [01:25:16] I wasn't able to fix the IABot folder [01:25:19] not sure why [01:25:34] madhuvishy: I believe IABot is owned by root. [01:25:40] yes right now [01:25:48] So I'll have to fix that myself. [01:26:01] okay [01:26:13] executable permissions have been lost [01:26:42] you will probably have to fix that with chmod +x [01:27:15] madhuvishy: I'm resetting all permissions. [01:27:16] I've gotta run, let me know if it goes okay! sorry for the trouble [01:27:22] Just to ensure consistency [01:27:35] madhuvishy: thanks for your help [01:37:00] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:40:15] madhuvishy: the restore deleted a bunch of files. [01:40:40] permissions are no longer an issue but a whole bunch of Peachy Includes are missing. [01:44:02] I have restored the missing files from my local copy, but really, WTH? [01:46:17] Oh lovely, half of my bot scripts are missing too. [01:47:01] I'm surprised the linux OS on that instance is still intact [01:49:07] Well that was fun [01:52:50] madhuvishy: when you have the chance can you investigate what happened to my exec node? Literally 60% of my bot files were gone and had to be restored. [02:42:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:08:00] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:37:42] PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [04:17:43] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [04:38:44] PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [05:15:01] 10Quarry: Explain command forces Quarry to keep running endlessly - https://phabricator.wikimedia.org/T155808#2955224 (10Soni) [05:18:42] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [06:13:03] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:39:02] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [06:39:42] PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [07:15:32] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2955290 (10yuvipanda) @akosiaris hmm I'd really like to keep the pin in puppet - there's enough uncertainity as is without having to find docker version mis... [07:30:07] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2955295 (10yuvipanda) If we have only one version it also means we are tying the prod and tools versions together forever, with upgrades needing to happen a... [07:44:44] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [08:32:35] 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955358 (10Kenrick95) [08:42:53] !log wikilabels deploying 15f7a42 to staging [08:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [08:44:02] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:09:59] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [09:16:59] !log wikilabels deploying 15f7a42 to prod [09:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [09:38:29] 06Labs, 10Tool-Labs: Mail from cron regarding a failure of jsub - https://phabricator.wikimedia.org/T155787#2955533 (10MarcoAurelio) [09:42:46] 06Labs, 10Tool-Labs, 10DBA: Provisioning MySQL replica users fails on tool labs - https://phabricator.wikimedia.org/T151014#2955540 (10Marostegui) @yuvipanda ok to close this ticket? [09:51:25] 10Tool-Labs-tools-LTA-Knowledgebase: Create password change function - https://phabricator.wikimedia.org/T155675#2955545 (10Legoktm) >>! In T155675#2951700, @Samtar wrote: > @Legoktm not really, I was modelling the account request/login structure off of UTRS - to be honest I don't understand how OAuth would be u... [10:10:40] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:13:47] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955587 (10hashar) [10:15:03] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [10:15:43] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955587 (10yuvipanda) Is this running the latest puppet code? [10:20:20] 06Labs, 10Tool-Labs, 10Tools-Kubernetes: Reassign service/pod IP ranges for kubernetes on tool labs - https://phabricator.wikimedia.org/T152399#2955613 (10yuvipanda) What we need to do: 1. Verify that `192.168.0.0/16` is a good range to use for pod IPs. We currently allocate a /24 out of this to each kubern... [10:29:06] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955638 (10hashar) The puppetmaster was stall with 2-3 days of lag and I rebased yesterday just... [10:31:53] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955641 (10yuvipanda) Yes, there's probably going to be a refactor at some point for that. Doe... [10:41:00] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [10:51:32] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955655 (10hashar) On `integration-slave-jessie-1001` ``` # /usr/local/sbin/nfs-mount-manager... [10:59:25] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955675 (10hashar) That matches the hosts failing puppet: ``` # salt --ou... [11:02:03] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted - https://phabricator.wikimedia.org/T155820#2955685 (10hashar) https://gerrit.wikimedia.org/r/#/c/333230/ cherry pick... [11:03:53] 06Labs, 10Tool-Labs: Mail from cron regarding a failure of jsub - https://phabricator.wikimedia.org/T155787#2955686 (10zhuyifei1999) a:03zhuyifei1999 The error was probably from the NFS outage, but the error reporting is obviously broken: ``` >>> try: ... f = open('/dev/full', 'w'); f.write('a'); f.close... [11:15:33] (03PS1) 10Zhuyifei1999: jsub: Change IOError string substitution conversion from '%e' to '%s' [labs/toollabs] - 10https://gerrit.wikimedia.org/r/333232 (https://phabricator.wikimedia.org/T155787) [11:15:39] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:39:18] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 07Beta-Cluster-reproducible, 07Puppet: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2955738 (10hashar) That is still happening. Happened today when creat... [11:49:51] 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955358 (10Mattho69) Same issue for me [12:01:03] !log video restarting all currently running v2c workers (celery{1,2}@encoding0{1,2}) and web frontend due to NFS outage, and we aren't very sure which component is erroring (T155803) [12:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [12:01:06] T155803: ERROR IOError: {Error no 116] Stale file handle - https://phabricator.wikimedia.org/T155803 [12:06:44] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:21:53] 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955942 (10yuvipanda) can you go to control panel, stop your server and start it again to see if it still persists? We had some NFS issues earlier that should be fixed now... [13:22:13] 06Labs, 10Tool-Labs, 10DBA: Provisioning MySQL replica users fails on tool labs - https://phabricator.wikimedia.org/T151014#2955943 (10yuvipanda) Yup! [13:23:45] 06Labs, 10Tool-Labs, 10DBA: Provisioning MySQL replica users fails on tool labs - https://phabricator.wikimedia.org/T151014#2955946 (10Marostegui) 05Open>03Resolved [13:30:52] 10PAWS: PAWS: Error loading notebook (Disk I/O error) - https://phabricator.wikimedia.org/T155812#2955953 (10Kenrick95) 05Open>03Resolved It works now after stopping and starting the server. Thanks :) [14:50:21] 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2757207 (10Joe) Today I wanted to go around horizon to check and refactor hiera keys before merging https://gerrit.wikimedia.org/r/#/c/332355/. It was a very frustrating experience, and I think it is a good thing to just rep... [14:50:28] 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2956043 (10Joe) p:05Triage>03Unbreak! [14:51:08] 06Labs, 06Operations, 07Puppet: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2956045 (10Joe) [15:03:36] zhuyifei1999_: Thanks for T155787 [15:03:37] T155787: Mail from cron regarding a failure of jsub - https://phabricator.wikimedia.org/T155787 [15:03:54] np [15:04:36] zhuyifei1999_: First phabriactor Task and (almost) fixed within hours. I am happy :-D [15:05:40] it's not fixed though [15:06:01] it was a temporary NFS outage [15:06:14] the error reporting broke [15:06:15] almost � the review is missing and the roll out [15:06:43] and the patch only addresses the error reporting [15:07:27] No problem with the error itself. One task starts every 10 minutes, so it does not matter and I the other one by hand ~ 1 hour later and it worked. [15:36:31] !log tools disable puppet everywhere to cherrypick patch moving base to a profile [15:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:37:17] joe: done [15:37:31] <_joe_> cool [15:37:39] <_joe_> cherrypicking now [15:39:06] <_joe_> running puppet on the puppetmaster first [15:39:18] <_joe_> let's see if I did everything right :) [15:39:19] <_joe_> uhm [15:39:31] <_joe_> +# Don't allow people to forward their agents either. [15:39:32] <_joe_> +AllowAgentForwarding no [15:39:37] <_joe_> where is this configured [15:39:50] joe: labs.yaml [15:39:53] joe: in ops/puppet [15:40:04] <_joe_> ahhh and of course since it's a single setting [15:40:08] <_joe_> it's overridden [15:40:11] <_joe_> ok, right [15:40:13] <_joe_> meh [15:40:15] <_joe_> fixing it [15:40:41] PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:41:27] not sure where ^ is from, I see that it's disabled cross fleet [15:41:46] <_joe_> yuvipanda: I'll look in a few [15:42:07] _joe_: no, it's strange because puppet shouldn't be running there at all [15:42:23] <_joe_> yuvipanda: it might have been running while you disabled it [15:42:32] joe: ah, good call [15:42:45] <_joe_> yuvipanda: what's the cache time of the mwyaml backend? [15:42:48] <_joe_> 1 minute? [15:42:59] joe: checking... [15:43:08] damn, my emacs locked up again [15:44:56] joe: yes 60 [15:44:56] s [15:45:09] <_joe_> yuvipanda: ok I think it's properly a noop now [15:45:24] joe: ok! wanna test on another host too? [15:45:34] yuvipanda: when you have time, could you look at https://gerrit.wikimedia.org/r/#/c/333232/ ? [15:45:40] <_joe_> I'm doing it [15:45:58] zhuyifei1999_: will do! [15:46:01] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2879938 (10MoritzMuehlenhoff) But reprepro somewhat supports multiple versions as long as they're stored in different sections (or whatever the exact termin... [15:47:02] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2956102 (10yuvipanda) We'll have to create maybe a 'labs' section in reprepo and use it? [15:47:31] <_joe_> yuvipanda: running now on tools-exec-1411 specifically [15:47:55] joe: ok! [15:48:20] <_joe_> noop! [15:48:28] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2956105 (10MoritzMuehlenhoff) That or maybe "staging" to make it a little more generic. [15:49:58] joe: ok, shall I start enabling puppet group by group? [15:50:08] wanna test it on, say, bastion first (it has more ssh stuff) [15:54:34] <_joe_> yuvipanda: yep, definitely [15:54:39] <_joe_> more ssh stuff? where? [15:54:44] <_joe_> I couldn't find it [15:54:46] _joe_: noop! I'm going to test the grid master now, and then just enable it [15:54:54] <_joe_> test the bastions [15:55:02] <_joe_> if they have specific ssh things [15:55:05] joe: yep, did. was a noop [15:55:20] _joe_: nope, they didn't! I must've remembered wrong [15:55:32] <_joe_> ok cool [15:55:41] <_joe_> ssh is the only thing that could be problematic [15:58:18] joe: ok seems good. going to deploy across now [15:58:40] !log tools enabling puppet across all hosts [15:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:00:40] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [16:20:38] 06Labs: labstore1004 is high load and periodic unavailability to icinga - https://phabricator.wikimedia.org/T155832#2956182 (10Paladox) [17:00:04] doing a revision import API gives me the return: import {{ns 0 title {D-Day Dodgers} revisions 123}} <- but there has been no revision to dewiki imported. I notice this phenomenon more and more frequently. The next import job works well again. Is there anything wrong with import API? [17:46:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:14:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:14:17] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:16:55] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:18:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:18:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:22:08] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:22:54] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:23:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:23:52] 10Tool-Labs-tools-LTA-Knowledgebase: Migrate to OAuth - https://phabricator.wikimedia.org/T155841#2956430 (10Samtar) [18:24:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:24:18] hmmm [18:24:19] not sure what's going on [18:25:22] PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:26:04] PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:27:07] yuvipanda: can you give the wikimedia api a nudge before you punch out? It's been 80 minutes now with no update :( [18:28:33] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:29:20] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:33:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [18:37:01] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:37:47] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 07Beta-Cluster-reproducible, 07Puppet: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2736876 (10scfc) (T152941 is slightly related, but refers to the case... [18:39:09] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [18:43:38] yuvipanda: you're very familiar with Meteor, right? [18:53:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [18:54:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [18:56:52] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [18:57:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [18:57:52] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [19:03:09] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [19:03:31] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [19:04:37] RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [19:05:21] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [19:06:03] RECOVERY - Puppet run on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [19:19:09] PROBLEM - Free space - all mounts on tools-exec-1221 is CRITICAL: CRITICAL: tools.tools-exec-1221.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-exec-1221.diskspace.root.byte_percentfree (<11.11%) [19:36:23] (03PS1) 10Andrew Bogott: Update novaobserver passwd [labs/private] - 10https://gerrit.wikimedia.org/r/333290 [19:38:19] (03CR) 10Alex Monk: [V: 032 C: 032] Update novaobserver passwd [labs/private] - 10https://gerrit.wikimedia.org/r/333290 (owner: 10Andrew Bogott) [19:43:51] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add fake clushuser keypair [labs/private] - 10https://gerrit.wikimedia.org/r/325050 (owner: 10Merlijn van Deen) [19:44:08] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add tools hiera common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/325041 (owner: 10Merlijn van Deen) [20:13:54] Is the unique user id in the oauth return value the "exp", "sub", "iat" or "aud"? Where can I look up these values? [20:33:26] tobias47n9e: https://www.mediawiki.org/wiki/Extension:OAuth#Identify_the_User_.28optional.29 [20:33:44] looks like "sub" is the user_id [20:35:00] bd808: Thanks didn't read that page before. [20:35:41] there is some good stuff there and in the sub pages. Not all of our docs are horrible :) [20:36:13] bd808 By the way. Can someone from the mediawiki technical team review my Django oauth backend? It would be nice to know if it handles security and privacy correctly [20:37:22] I'd be glad to take a look. where is your source? [20:39:24] bd808: I can push the branch in a few hours and ping you. Doesn't have to be reviewed today, but next week would be great. [20:42:02] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:56:32] tobias47n9e: a phab task assigned to me would help. I'm juggling a lot of stuff right now but I can make some time to help out :) [20:59:19] 06Labs: labstore1004 is high load and periodic unavailability to icinga - https://phabricator.wikimedia.org/T155832#2956817 (10chasemp) [21:00:48] 06Labs: labstore1004 is high load and periodic unavailability to icinga - https://phabricator.wikimedia.org/T155832#2956151 (10chasemp) Working from the rp_filter confusion theory and the ongoing icinga issues (somewhere between 2-4 host unavailable warnings per hour over the previous 8 hours at least) I shutdow... [21:08:02] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [21:25:22] 10Tool-Labs-tools-LTA-Knowledgebase: Add password confirmation box - https://phabricator.wikimedia.org/T155854#2956884 (10Mike1901) [21:53:03] 10Tool-Labs-tools-LTA-Knowledgebase: Add password confirmation box - https://phabricator.wikimedia.org/T155854#2957007 (10Samtar) 05Open>03Resolved [22:18:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:30:14] 10Tool-Labs-tools-LTA-Knowledgebase: Require confirmation diff on account request - https://phabricator.wikimedia.org/T155704#2957111 (10Samtar) 05Open>03Resolved [22:40:57] PROBLEM - Free space - all mounts on tools-exec-1220 is CRITICAL: CRITICAL: tools.tools-exec-1220.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-exec-1220.diskspace.root.byte_percentfree (<50.00%) [23:09:03] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [23:10:32] * Zppix brb