[00:02:36] <jdlrobson>	 YuviPanda: i can't seem to delete a resource https://wikitech.wikimedia.org/wiki/Nova_Resource:Mf-browser-tests.mobile-smoketests.eqiad.wmflabs
[00:02:49] <jdlrobson>	 Just tells me 'The requested host does not exist.'
[00:07:01] <Katie>	 jynus: Sweet! I guess only the templatelinks table is affected?
[00:08:05] <YuviPanda>	 jdlrobson: log out and back in
[00:38:49] <Luke081515>	 Hello, the beta cluster is not reachable für me now, can somebody help me?
[01:17:57] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 44.44% of data above the critical threshold [0.0]
[02:12:47] <wikibugs>	 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1490483 (10yuvipanda) I'm going to count this as done unless @andrewbogott objects.
[02:41:39] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Enable OpenJDK 8 - https://phabricator.wikimedia.org/T68171#1490519 (10MZMcBride)
[03:52:57] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0]
[04:48:42] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1404 is CRITICAL tools.tools-webgrid-lighttpd-1404.diskspace.root.byte_percentfree (<40.00%)
[05:28:32] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1406 is CRITICAL tools.tools-webgrid-lighttpd-1406.diskspace.root.byte_percentfree (<22.22%)
[05:56:40] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 14.29% of data above the critical threshold [0.0]
[06:34:52] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 40.00% of data above the critical threshold [0.0]
[06:36:40] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0]
[06:43:31] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1406 is OK All targets OK
[06:43:41] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1404 is OK All targets OK
[06:48:13] <shinken-wm>	 PROBLEM - Puppet failure on tools-submit is CRITICAL 44.44% of data above the critical threshold [0.0]
[06:49:50] <Tuttsie>	 hi there. https://tools.wmflabs.org/geohack/geohack.php is down... can you restart the service?
[06:57:29] <YuviPanda>	 Tuttsie: done
[06:59:02] <Tuttsie>	 thanks, but still not working. e.g.https://tools.wmflabs.org/geohack/geohack.php?params=51_12_N_6_42_E
[07:03:06] <YuviPanda>	 Tuttsie: temporary workaround in place...
[07:06:16] <Tuttsie>	 ok, running again. thank you!
[07:07:19] <wikibugs>	 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1490787 (10yuvipanda) 3NEW
[07:07:31] <YuviPanda>	 Tuttsie: ^ I have filed a bug to note the workaround
[07:09:55] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1403 is OK Less than 1.00% above the threshold [0.0]
[07:23:18] <shinken-wm>	 RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0]
[07:31:25] <wikibugs>	 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1490820 (10Legoktm) Do you mean you moved it to precise? Or...?
[07:58:19] <wikibugs>	 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Isolation: Investigate non blocking fs resizing when instance is booted - https://phabricator.wikimedia.org/T104974#1433499 (10hashar)
[08:36:43] <wikibugs>	 10Tool-Labs-tools-Other: Geohack should be mobile friendly - https://phabricator.wikimedia.org/T103409#1490862 (10Thgoiter) Maybe replacing Geohack with Extension:MapSources could solve this?  >>! In T102960#1378990, @scfc wrote: > I think it is more useful to replace it with [[https://www.mediawiki.org/wiki/Ext...
[08:55:12] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1490882 (10GoldenRing) Mmmmkay.  I've put this in ~/uwsgi.ini:  ''' [uwsgi] socket = /tmp/movestats.sock chdir = /data/project/movestats/www/python/src pythonpath = /data/project/movestats/www/python...
[09:25:49] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1490971 (10GoldenRing) On a little more investigation, it seems that `--venv` does work if `--plugin python3` (or `--plugin python`) is also specified.  Putting `plugin = python3` in the ini file doe...
[09:27:49] <GoldenRing>	 YuviPanda: are you about?
[09:30:01] <GoldenRing>	 Or anyone else who'd be happy to have a look at the uwsgi-plain support added yesterday?
[10:42:18] <jem>	 Lydia_WMDE: around?
[11:01:03] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1491214 (10GoldenRing) Or, in fact, even just reversing the order in which the `--venv` and `--ini` arguments are added to the args list would be good enough.  If `--ini` comes first, then you can sp...
[11:01:49] <GoldenRing>	 Is there anyone about who'd be willing to patch the uwsgi-plain support added yesterday?
[11:47:56] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 37.50% of data above the critical threshold [0.0]
[12:22:55] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0]
[12:29:39] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1491353 (10valhallasw) I think --venv shouldn't be passed at all by default, and instead should be provided in the .ini file.
[12:31:25] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1491354 (10GoldenRing) Yes, I guess that'd also work!
[12:48:55] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 55.56% of data above the critical threshold [0.0]
[13:23:54] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0]
[13:40:06] <Luke081515>	 Hello, login at the beta cluster seems to be broken, can somebody help?
[13:42:03] <JohnFLewis>	 Luke081515: can you give some more information? what sort of errors, which page(s)?
[13:43:26] <Luke081515>	 The beta cluster works, but when I try to login there is this screen: Wikimedia Foundation Error
[15:13:57] <andrewbogott>	 Luke081515: thanks for the bug report; folks in another channel are working on this.
[15:16:11] <Luke081515>	 ok, thank you
[15:45:26] <Jeph_paul>	 I have a cluebot related question, I tried  #cluebotng, but I wasn't able to join. Is cluebot deployed on wiki's other than en?
[15:46:00] <Jeph_paul>	 If so where can I find the list of languages it has been working on?
[15:51:19] <wikibugs>	 6Labs, 10Tool-Labs: shinken is too "volatile" and imprecise to be of use - https://phabricator.wikimedia.org/T107297#1491854 (10scfc) 3NEW
[16:43:27] <grrrit-wm>	 (03PS1) 10Ejegg: Filter WMDE- out of #wikimedia-fundraising [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/227746 
[17:15:57] <wikibugs>	 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1492200 (10Andrew) I assume by 'services' we mean 'service that labs users will notice' right?  In that case this list looks correct... it doesn't include e.g. keystone or nova api, but those are maybe out of s...
[17:17:09] <wikibugs>	 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1492201 (10yuvipanda) Indeed, that is correct.
[17:44:25] <ostriches>	 Bleh, wonder what's causing this? https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=deployment-prep&instanceid=9596c826-036f-47d8-8825-2fa4fa035422&region=eqiad
[17:45:08] <ostriches>	 Could not request certificate: Connection refused - connect(2) for "" port 8140
[17:47:46] <ostriches>	 YuviPanda: ? ^
[17:48:13] <YuviPanda>	 andrewbogott: ^ this again. the puppetmaster name is not set...
[17:48:24] <YuviPanda>	 ostriches: are you creating an instance with a same name as an instance that used to exist?
[17:48:27] <ostriches>	 Yep
[17:48:36] <YuviPanda>	 yeah, that seems to cause it
[17:48:47] <YuviPanda>	 ostriches: current workaround is to... not do that 
[17:48:54] <YuviPanda>	 let me file a bug to make sure
[17:48:56] <ostriches>	 Stupid workaround :(
[17:49:47] <wikibugs>	 6Labs: Creating an instance with same name as deleted instance fails - https://phabricator.wikimedia.org/T107325#1492299 (10yuvipanda) 3NEW
[17:49:51] <YuviPanda>	 ostriches: indeed
[17:53:06] <GoldenRing>	 YuviPanda: Any chance you could review 227690 today, please?
[17:54:09] <YuviPanda>	 GoldenRing: looking
[17:56:23] <YuviPanda>	 GoldenRing: done
[17:56:30] <YuviPanda>	 GoldenRing: it'll run on all hosts over the next 20mins
[17:56:39] <GoldenRing>	 Many thanks.
[17:56:43] <YuviPanda>	 yw
[17:56:50] <YuviPanda>	 sorry I missed that bit in my earlier patch
[17:58:15] <andrewbogott>	 ostriches: I think it’s more along the lines of “don’t do that for 5 minutes or so"
[17:58:30] <andrewbogott>	 But I’ll see if I can tighten that up
[17:58:49] <wikibugs>	 6Labs: Creating an instance with same name as deleted instance fails - https://phabricator.wikimedia.org/T107325#1492332 (10Andrew) a:3Andrew
[18:09:30] <wikibugs>	 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1492365 (10scfc) I don't see any remaining "UNKNOWN"s at http://shinken.wmflabs.org/problems for #Tool-Labs, so I'm closing this task here with the assumption that the underlying cause cannot be investigated at the moment.
[18:09:48] <wikibugs>	 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1492369 (10scfc) 5Open>3Resolved a:3valhallasw
[18:22:17] <wikibugs>	 6Labs, 10Tool-Labs: croptool creates huge temporary files - https://phabricator.wikimedia.org/T107328#1492388 (10scfc) 3NEW
[18:23:21] <wikibugs>	 6Labs, 10Tool-Labs: croptool creates huge temporary files - https://phabricator.wikimedia.org/T107328#1492399 (10scfc) p:5Triage>3Lowest
[18:25:45] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 20.00% of data above the critical threshold [0.0]
[18:28:04] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1492411 (10Andrew) unmounting and remounting nfs mount points seems to fix the issue, at least temporarily.  Now I'm trying to see if it's any one particular mount that is...
[18:34:15] <shinken-wm>	 PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 33.33% of data above the critical threshold [0.0]
[18:35:46] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0]
[18:43:48] <andrewbogott>	 Coren: before I (re)-schedule a round of reboots, can you review valhalla’s work on https://phabricator.wikimedia.org/T107052 and see if you have any alternative ideas?
[18:44:11] <Coren>	 andrewbogott: kk, doing so now.
[18:44:59] <Niharika>	 Hi, I'm trying to ssh into https://wikitech.wikimedia.org/wiki/Nova_Resource:Grantreview instance but I accidentally deleted its SSH keys from my known_hosts, so it doesn't let me ssh in now. 
[18:45:58] <Coren>	 andrewbogott: Oh, that.  Yeah, I've spent a couple of hours trying to track this down last evening - as far as I can tell, those are completely lost locks that are still being held on by the kernel - possibly as a consequence of NFS restarts when we split the filestsrems in three.
[18:46:14] <andrewbogott>	 Niharika: that shouldn't stop you, it’ll just prompt you to confirm.
[18:46:27] <Niharika>	 I get the error: "channel 0: open failed: administratively prohibited: open failed ssh_exchange_identification: Connection closed by remote host"
[18:46:49] <Niharika>	 That's what I thought would happen. 
[18:46:50] <Coren>	 andrewbogott: As far as I can tell, there is no way to instruct the kernel to drop those short of a complete unmount/remount.
[18:47:02] <andrewbogott>	 Coren: ok, rebooting it is then
[18:47:37] * Coren adds that to the ticket, actually, to document.
[18:47:54] <andrewbogott>	 Niharika: can you tell me what exact ssh commandline you are using?
[18:47:58] <andrewbogott>	 And, has this worked in the past?
[18:48:08] <andrewbogott>	 (I can ssh in to grantreview-dev.grantreview.eqiad.wmflabs just fine.)
[18:48:29] <Niharika>	 andrewbogott: I'm using "ssh grantsreview-dev.eqiad.wmflabs" and yes it's worked before.
[18:48:35] <Niharika>	 Oh!
[18:48:40] <Niharika>	 Spelling mistake. Sorry. 
[18:48:45] <andrewbogott>	 np :)
[18:49:09] <Niharika>	 Thanks!
[18:49:13] <shinken-wm>	 RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0]
[18:50:17] <Luke081515>	 andrewbogott: Login at beta works now :)
[18:50:44] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1492479 (10coren) My inspection of the packets coupled with reading the source seem to point to the kernel holding advisory locks over files on NFS, but the NFS server hav...
[18:50:45] <andrewbogott>	 Luke081515: great!
[18:50:57] <andrewbogott>	 Luke081515: will you update your bug accordingly, if someone else has not already?
[18:51:08] <Luke081515>	 I will do it now
[18:51:34] <Luke081515>	 done
[18:56:46] <andrewbogott>	 Luke081515: thanks!
[18:58:21] <Luke081515>	 no problem, thank you too
[18:58:22] <andrewbogott>	 Coren: ok, I scheduled a bastion reboot.  I think I can fix the webgrid nodes without restarts.
[19:13:31] <bd808>	 !log reading-web-staging Added Bmansurov as a project admin
[19:14:47] <wikibugs>	 6Labs, 10Wikimedia-Labs-General, 6operations, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1492509 (10jcrespo)
[19:14:49] <wikibugs>	 6Labs, 10Tool-Labs, 7Database: Tool Labs enwiki_p replicated database missing rows - https://phabricator.wikimedia.org/T106470#1492505 (10jcrespo) 5Open>3Resolved s1-master:   ``` mysql> SELECT     -> tl_title     -> FROM page     -> JOIN templatelinks     -> ON tl_from = page_id     -> WHERE page_namesp...
[19:23:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 20.00% of data above the critical threshold [0.0]
[19:24:06] <shinken-wm>	 PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 44.44% of data above the critical threshold [0.0]
[19:24:14] <shinken-wm>	 PROBLEM - Puppet failure on tools-submit is CRITICAL 22.22% of data above the critical threshold [0.0]
[19:24:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-services-02 is CRITICAL 11.11% of data above the critical threshold [0.0]
[19:25:48] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 10.00% of data above the critical threshold [0.0]
[19:26:57] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 20.00% of data above the critical threshold [0.0]
[19:30:04] <andrewbogott>	 ^ Those are me, nothing to worry about
[19:33:07] <Coren>	 YuviPanda: It looks like all is well.  Next step: set timers for cleanup and replicate?
[19:33:23] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0]
[19:34:06] <YuviPanda>	 Coren: hmm, so are there any negatives to having the cleanup run very very often? like every minute?
[19:34:13] <shinken-wm>	 RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0]
[19:34:21] <shinken-wm>	 RECOVERY - Puppet failure on tools-services-02 is OK Less than 1.00% above the threshold [0.0]
[19:34:57] <Coren>	 YuviPanda: Yes, lvm operations (especially lvs, vgs) are I/O killers as they require flushing all devices for their scans.
[19:35:07] <YuviPanda>	 Coren: the reason I ask is that then they can because daemons that do a sleep 60 and then we have a super easy way of monitoring - let it crash and then monitor for process
[19:35:07] <Coren>	 YuviPanda: They create write barriers.
[19:35:11] <YuviPanda>	 ah I see
[19:35:13] <YuviPanda>	 hmm
[19:35:20] <YuviPanda>	 how are we going to monitor these if they're on timers?
[19:35:45] <Coren>	 YuviPanda: Hm.  Simple check of timestamp-of-last-run?
[19:35:47] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0]
[19:36:02] <YuviPanda>	 Coren: hmm, we'll have to maintain that state somewhere now
[19:36:17] <Coren>	 YuviPanda: Doesn't systemd already do this?
[19:36:59] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0]
[19:37:17] <YuviPanda>	 Coren: oh, does it?
[19:37:31] <Coren>	 YuviPanda: I'm pretty sure it does.  Lemme rtfm
[19:38:47] <jynus>	 I though several times about labsdb1004, and I think I can set it up without downtime
[19:39:08] <jynus>	 it will be painful and slow, but it may be done
[19:39:36] <Coren>	 jynus: Slow is not an issue, I think.  What about risky?
[19:40:14] <Coren>	 YuviPanda: It's easy to find the last run, lemme see if we can determine with certitude whether it suceeded or failed.
[19:40:21] <YuviPanda>	 jynus: remember it's running debian jessie now, with different mysql package versions
[19:41:48] <jynus>	 we will go for production
[19:41:52] <jynus>	 versions
[19:42:04] <jynus>	 it works on jessie
[19:42:24] <jynus>	 it doesn't matter anyway
[19:42:29] <Coren>	 YuviPanda: Yep; a failed run sets the state, and we can check for that with 'systemd is-failed <unit>'
[19:42:48] <jynus>	 Coren, not risky, but it may take some time and not work in the end
[19:42:51] <YuviPanda>	 jynus: sweet.
[19:43:03] <jynus>	 that is your call
[19:43:17] <YuviPanda>	 Coren: sweet. so I'd say next step is to write that monitoring, verify that it triggers when we don't run it manually, and then set the timers
[19:43:25] <jynus>	 i prefer production because it is what I do 95% percent of the time
[19:43:43] <jynus>	 but the setup is independent of that
[19:43:49] <Coren>	 jynus: IMO, it is worth a try - if the worse thing that can happen if that it didn't work and we do have to have a downtime.
[19:44:05] <jynus>	 that is exctly my thought
[19:45:18] <wikibugs>	 6Labs, 10Tool-Labs, 7Database, 3Labs-Q4-Sprint-1, and 4 others: Make sure tools-db is replicated somewhere - https://phabricator.wikimedia.org/T88718#1492577 (10jcrespo) I can try something without downtime, but I need the gerrit:218874 patch applied. Please +1 to apply it.
[19:46:04] <jynus>	 I will be busy now, but if you have the time, I can try it tomorrow on my morning
[19:59:08] <shinken-wm>	 RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0]
[20:01:12] <wikibugs>	 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1492611 (10yuvipanda) So remaining steps are:  [ ] Find a way to monitor script failure [ ] Find a way to monitor script hasn't run in X hours [ ] Make sure that the previous t...
[20:01:38] <wikibugs>	 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1492612 (10yuvipanda) @Coren says we can find out if the script failed or succeeded and the time from systemd itself. Now to write an nrpe check for it...
[20:01:43] <YuviPanda>	 Coren: ^ updated
[20:02:47] <Coren>	 YuviPanda: WARNING when not run in X hours, CRITICAL when last run is fail?
[20:03:17] <YuviPanda>	 Coren: no, I think they should be two checks, similar to how we have puppet
[20:03:41] <YuviPanda>	 Coren: failure should alert immmediately,  warning/critical when not run in X hours 
[20:04:05] <Coren>	 YuviPanda: That doesn't need two checks.
[20:04:07] <andrewbogott>	 Coren or YuviPanda, can you drain tools-webgrid-lighttpd-1404, or tell me how to drain it?  And repool 1401, I think it’s healthy again.
[20:04:42] <Coren>	 YuviPanda: Both tests can be made in one check; it's easy to warn at X and critical at X+Y /or/ last run failed.
[20:04:47] <YuviPanda>	 andrewbogott: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin
[20:05:01] <andrewbogott>	 thanks :)
[20:05:14] <YuviPanda>	 Coren: it's more clear if it's two checks, IMO. let's not override warn / critical to mean other things.
[20:06:02] <Coren>	 YuviPanda: ... I'm not seeing the override, tbh.  OK means all is well, warn means pay attention, critical means something broke fix it now.  :-)
[20:06:37] <Coren>	 YuviPanda: And I think it's clearer to have 1:1 unit<->check, but meh.  Two tests is not clearer imo, but doesn't harm.
[20:07:00] <YuviPanda>	 yep, so let's do two checks :)
[20:07:13] <YuviPanda>	 they are checking for different things - 1. has this failed? and 2. has this run recently enough?
[20:11:05] <andrewbogott>	 !log tools rebooting tools-webgrid-lighttpd-1404
[20:11:11] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy
[20:13:08] <YuviPanda>	 brb
[20:14:40] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1492652 (10Andrew) I have done an unmount/remount on those tools nodes where it was possible.  I've also rebooted tools-webgrid-lighttpd-1404 and tools-webgrid-lighttpd-14...
[20:36:35] <ostriches>	 andrewbogott: I waited like an hour, no dice. I'll just not do it at all for now :p
[20:37:13] <andrewbogott>	 ostriches: weird.  I’ll have a look now.  Can you remind me what the instance name/project is?
[20:39:15] <ostriches>	 deployment-prep, deployment-cache-bits04. I did just nuke it again out of frustration...
[20:39:31] <ostriches>	 The other deployment-cache-* boxen are ok
[20:39:37] <YuviPanda>	 andrewbogott: openocr instance in openocr project has similar issues
[20:40:21] <YuviPanda>	 @invite #wikimedia-ai
[20:40:24] <YuviPanda>	 boo
[20:40:25] <YuviPanda>	 wm-bot: help
[20:40:25] <wm-bot>	 Hi YuviPanda, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-)
[20:40:33] <YuviPanda>	 wm-bot: @invite #wikimedia-ai
[20:41:04] <wm-bot>	 Attempting to join #wikimedia-ai using wm-bot4
[20:41:05] <YuviPanda>	 @add #wikimedia-ai
[20:41:22] <andrewbogott>	 ostriches: trusty?
[20:42:22] <ostriches>	 andrewbogott: New instances are all jessie
[20:42:34] <andrewbogott>	 ostriches: also, can you give me an estimate of when you think the original instanace with that name was created?
[20:42:45] <ostriches>	 Earlier today? Couple of hours ago?
[20:43:37] <andrewbogott>	 ok, that’s specific enough
[20:43:39] <ostriches>	 andrewbogott: 17:30, 29 July 2015
[20:43:43] <ostriches>	 Was first attempt
[20:43:50] <ostriches>	 https://wikitech.wikimedia.org/wiki/Special:Undelete/Nova_Resource:Deployment-cache-bits04.deployment-prep.eqiad.wmflabs
[20:43:51] <andrewbogott>	 too specific!  Dial it back!
[20:43:54] <andrewbogott>	 :)
[21:43:44] <wikibugs>	 6Labs: Puppet run fails on tools-webgrid-lighttpd-1408 with "Failed to determined $::labsproject" - https://phabricator.wikimedia.org/T107350#1492912 (10scfc) 3NEW
[21:51:54] <bd808>	 marxarelli: https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs
[21:52:01] <bd808>	 edits and testing welcome
[21:53:06] <bd808>	 YuviPanda: ^
[21:54:39] <YuviPanda>	 bd808: nice!
[22:00:36] <marxarelli>	 bd808: awww yeah
[22:04:22] <wikibugs>	 6Labs: Puppet run fails on tools-webgrid-lighttpd-1408 with "Failed to determined $::labsproject" - https://phabricator.wikimedia.org/T107350#1492977 (10scfc) p:5Triage>3High According to http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Virtualization%20cluster%20eqiad&h=labcontrol1001.wikimedia.o...
[22:10:39] <wikibugs>	 6Labs: Puppet run fails on tools-webgrid-lighttpd-1408 with "Failed to determined $::labsproject" - https://phabricator.wikimedia.org/T107350#1492998 (10scfc) (Semi-related side note: Labs instances run Puppet twice per hour.  I don't remember if there was ever a requirement for that, i. e. a situation where som...
[22:12:02] <andrewbogott>	 Coren: If you’re not eating dinner, can I get a little dns help?
[22:15:50] <wikibugs>	 6Labs: Creating an instance with same name as deleted instance fails - https://phabricator.wikimedia.org/T107325#1493014 (10Andrew) 1)  Waiting for ages between delete and recreation seems to help.  2) The failing instances are failing because 'hostname -d' is throwing an error.  This happens before the firstboo...
[22:34:59] <wikibugs>	 10MediaWiki-extensions-OpenStackManager: OpenStackManager randomly lost my public key - https://phabricator.wikimedia.org/T107362#1493135 (10ori) 3NEW
[23:11:38] <wikibugs>	 6Labs, 3Labs-Sprint-107: Identify user facing services labs provides  - https://phabricator.wikimedia.org/T105721#1493288 (10yuvipanda)
[23:22:48] <wikibugs>	 6Labs, 5Patch-For-Review: Creating an instance with same name as deleted instance fails - https://phabricator.wikimedia.org/T107325#1493327 (10Andrew) 5Open>3Resolved This looks to be fixed by https://gerrit.wikimedia.org/r/#/c/227905/.
[23:26:23] <andrewbogott>	 ostriches: ok, I think I have fixed the issue.  Previously the magic spell was:  delete, and wait one hour before recreating.  Now the magic spell is (I think) delete and wait 10 seconds.
[23:26:45] <YuviPanda>	 andrewbogott: yay!
[23:28:42] <andrewbogott>	 For some reason the pdns template I used cranked the ttl values way up from the defaults.
[23:28:47] <andrewbogott>	 anyway, let me know if the issue persists.
[23:28:51] <tgr>	 YuviPanda: HTTPS connections end on the labs proxy, right?
[23:29:13] <tgr>	 does it some header that I can use to figure out what the protocol was?
[23:29:29] <YuviPanda>	 tgr: yes, x-forwarded-proto
[23:31:04] <Luke081515>	 andrewbogott: Can you look once again please? The API with action=edit works not at beta
[23:31:52] <andrewbogott>	 Luke081515: I don’t actually have anything to do with beta, and also the place to ask is in #wikimedia-releng
[23:31:58] <andrewbogott>	 (I concede that that is not obvious)
[23:32:39] <Luke081515>	 ok, I will do, thanks
[23:36:35] <wikibugs>	 6Labs: Puppet run fails on tools-webgrid-lighttpd-1408 with "Failed to determined $::labsproject" - https://phabricator.wikimedia.org/T107350#1493440 (10Andrew) This is because  # facter -p Error: Cannot allocate memory - /usr/sbin/dmidecode 2>/dev/null  I don't know what that's about, though -- it certainly isn...
[23:38:28] <wikibugs>	 6Labs: Puppet run fails on tools-webgrid-lighttpd-1408 with "Failed to determined $::labsproject" - https://phabricator.wikimedia.org/T107350#1493451 (10Andrew) Although  # facter -p labsproject tools  works fine
[23:43:02] <andrewbogott>	 !log tools draining, rebooting tools-webgrid-lighttpd-1408
[23:43:08] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy
[23:44:33] <wikibugs>	 6Labs: Puppet run fails on tools-webgrid-lighttpd-1408 with "Failed to determined $::labsproject" - https://phabricator.wikimedia.org/T107350#1493487 (10Andrew) 5Open>3Resolved a:3Andrew A reboot seems to have cheered it up.
[23:48:56] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 50.00% of data above the critical threshold [0.0]
[23:58:57] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0]