[00:01:07] <Krenair>	 aha
[00:01:12] <Krenair>	 I have confirmed the presence of the same problem in labtest
[00:01:47] <madhuvishy>	 Krenair: you have access to designate right? if not i'm happy to paste the results here
[00:01:58] <Krenair>	 Not in real-labs
[00:02:01] <madhuvishy>	 aah
[00:02:01] <Krenair>	 Well
[00:02:07] <madhuvishy>	 https://www.irccloud.com/pastebin/9ZgMvsTl/
[00:02:20] <Krenair>	 I have the novaadmin pass as a production deployer
[00:02:44] <Krenair>	 which means I can hit the openstack APIs, just not the underlying databases
[00:02:50] <madhuvishy>	 alright
[00:03:37] <Krenair>	 So here's the extent of the problem in labtest
[00:04:40] <Krenair>	 https://phabricator.wikimedia.org/P4078
[00:05:01] <Krenair>	 (note the different domain_id)
[00:07:21] <Krenair>	 all those are deleted instances in labtest with existing ptr records
[00:07:54] <Krenair>	 I imagine prod-labs' one is many times larger given contintcloud
[00:07:58] <madhuvishy>	 right
[00:08:58] <Krenair>	 https://phabricator.wikimedia.org/diffusion/GSNF/browse/master/nova_fixed_multi/base.py
[00:09:09] <Krenair>	 this is a code that handles creation of the labs internal dns data
[00:09:27] <Krenair>	 and, theoretically, deletion
[00:13:55] <madhuvishy>	 Krenair: brb
[00:19:36] <madhuvishy>	 Krenair: ok back, looking at this code
[00:20:08] <Krenair>	 nothing stands out at me as wrong here
[00:20:36] <Krenair>	 Think I'm going to live hack it on labtest to be noisy about what it's up to
[00:21:41] <madhuvishy>	 sure, let me know what you see
[00:29:08] <Krenair>	 I created and deleted one and it was cleaned up fine..
[00:29:18] <CristianCantoro>	 i would like to know, out of curiosity, if I can install jekyll on a tool account 
[00:29:28] <Krenair>	 madhuvishy, ^
[00:30:31] <madhuvishy>	 Krenair: ha, clearly that doesn't always happen - just makes it harder to debug :|
[00:30:56] <Krenair>	 madhuvishy, select max(nova.instances.deleted_at) from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '8d114f3c815b466cbdd49b91f704ea60' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null;
[00:31:39] <madhuvishy>	 it's probably gonna take a while
[00:31:51] <madhuvishy>	 running
[00:32:24] <Krenair>	 a while? that's worrying
[00:32:48] <Krenair>	 any idea how many rows nova.instances has?
[00:33:27] <bd808>	 CristianCantoro: sure. `gem install jekyll --user-install` should work.
[00:33:56] <madhuvishy>	 Krenair: 236862
[00:34:11] <Krenair>	 that's not a small number of instances
[00:34:55] <madhuvishy>	 yup
[00:35:36] <bd808>	 CristianCantoro: but the real beauty of jekyll is that it build static sites, so you technically wouldn't need it installed on a tool labs host. You could just build the static site locally and then upload the output to your tool's ~/public_html
[00:38:46] <shinken-wm>	 PROBLEM - Host tools-webgrid-lighttpd-1407 is DOWN: CRITICAL - Host Unreachable (10.68.17.251)
[00:40:58] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/राजा उप्रेती was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=851629 edit summary: 
[00:49:53] <Krenair>	 hey madhuvishy 
[00:50:00] <Krenair>	 my irc died
[00:50:02] <madhuvishy>	 Krenair: yup
[00:50:12] <madhuvishy>	 and my ssh client zoned out
[00:50:17] <madhuvishy>	 the query didn
[00:50:21] <madhuvishy>	 did not finish
[00:50:25] <madhuvishy>	 rerunning
[00:50:35] <madhuvishy>	 but its taking a really long time
[00:52:23] <Krenair>	 the query is just supposed to find the time of the most recent occurrence of the issue
[00:52:28] <madhuvishy>	 yeah
[00:55:06] <Krenair>	 so I know whether we're just being affected by historical nonsense and need to do a one-time clean-up
[00:55:12] <Krenair>	 or if this is an ongoing issue
[01:08:53] <andrewbogott>	 madhuvishy: I think there's a 'deleted' flag
[01:09:08] <andrewbogott>	 oh, except I guess you want deleted instances in this case?
[01:09:43] <andrewbogott>	 ok, sorry, I'm not helping, I didn't read the backscroll thoroughly enough
[01:11:42] <andrewbogott>	 This is one of those problems that won't happen when I'm looking for it.  When I was at the last OpenStack conference the designate-sink devs said something about how their notification subscription (which should detect creation and deletion) turned out to be a dead end and they've totally replaced it now.
[01:11:52] <andrewbogott>	 So I have my hopes somewhat pinned on the idea that that's what's biting us too
[01:12:12] <andrewbogott>	 But I haven't dug in enough to know e.g. what the new solution is or when it was released.
[01:13:52] * andrewbogott -> AFK again
[01:16:05] <madhuvishy>	 Krenair: here you go
[01:16:09] <madhuvishy>	 https://www.irccloud.com/pastebin/raLWrInq/
[01:16:47] <Krenair>	 hm
[01:16:55] <Krenair>	 last one almost two weeks ago
[01:18:39] <madhuvishy>	 yeah doesn't seem too historical
[01:20:46] <Krenair>	 was hoping it'd be either several months ago or within the last 24 hours :/
[01:21:29] <madhuvishy>	 ha ha
[01:31:08] <Krenair>	 madhuvishy, how far back to the designate logs on labservices1001/1002 go?
[01:31:12] <Krenair>	 do the*
[01:31:26] <madhuvishy>	 i have never been on that box - looking
[01:37:35] <madhuvishy>	 Krenair: 2016-08-30 on labservices1001
[01:38:21] <madhuvishy>	 2016-08-02 on 1002
[01:41:10] <Krenair>	 perhaps the logs from around the max(deleted_at) time could be found?
[01:43:52] <Krenair>	 like `zgrep "2016-09-08 22:46:" /var/log/designate/designate-sink*.gz`
[01:49:39] <madhuvishy>	 Krenair: haven't found anything yet
[01:50:17] <Krenair>	 damn
[01:50:27] <madhuvishy>	 Krenair: some api logs - but from 17:46:31
[01:50:41] <Krenair>	 so a completely different time of day? not much use
[01:50:53] <madhuvishy>	 ya
[01:51:26] <Krenair>	 when I ran the equivalent in labtest it showed me that labtest's last dns record issue like this happened after a different plugin of ours (the one that updates the ldap host entries) couldn't connect to ldap
[01:56:40] <madhuvishy>	 i don't know what this is but designate-api logs for all 2016-09-08 17:46:31 looks like this
[01:56:50] <madhuvishy>	 https://www.irccloud.com/pastebin/7puSfmZk/
[01:57:58] <Krenair>	 we're mainly looking at designate-sink
[01:59:20] <madhuvishy>	 yeah that has nothing
[02:02:19] <Krenair>	 I'm out of ideas
[02:07:34] <madhuvishy>	 Krenair: yeah - let's call it a day - and look tomorrow
[03:23:16] <Debra>	 Hi.
[03:23:32] <Debra>	 Is https://dumps.wikimedia.org/enwiki/20160901/enwiki-20160901-pages-meta-current.xml.bz2 on tools-bastion-03?
[03:23:45] <Debra>	 We had /data/scratch/dumps at some point.
[03:23:49] <Debra>	 It seems to have gone missing, I guess.
[03:24:52] <Debra>	 I re-created /data/scratch/dumps/enwiki/ and I'm putting a dump in there.
[03:25:17] <yuvipanda>	 Debra: why are you putting dumps in there instead of using the dumps from /public/dumps?
[03:25:18] <Debra>	 I can put it somewhere else if it's really a problem, but it looks like scratch has plenty of free space.
[03:25:32] <Debra>	 Oh, is there a /public/?
[03:25:39] <yuvipanda>	 yup, has all the dumps
[03:25:58] <Debra>	 Ah, perfect, thanks!
[03:26:21] <yuvipanda>	 yw
[03:28:53] <wikibugs>	 10Striker, 06Community-Tech-Tool-Labs, 15User-bd808: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710#2654873 (10bd808)
[03:31:49] <Debra>	 I'm glad I asked in here.
[03:32:35] <Debra>	 I made some redirects: https://wikitech.wikimedia.org/wiki/Database_dumps && https://wikitech.wikimedia.org/wiki//public/dumps
[03:36:14] <Debra>	 /shared/mediawiki is also neat. I've been wondering about that as well.
[03:36:44] <yuvipanda>	 thank you very much, Debra 
[03:36:54] <yuvipanda>	 Debra: use '/data/project/shared' instead of /shared. the latter is deprecated
[03:36:59] <yuvipanda>	 (they point to the same thing)
[03:40:08] <Debra>	 Fair enough. I made the redirects point to the same section.
[07:38:12] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2655239 (10Marostegui) ``` root@db1069:/srv# find . -name user_groups.frm  | awk -F "." '{print $3}' | awk -F "/" '{print $1}' | uniq -c | sort -k2...
[08:09:32] <shinken-wm>	 PROBLEM - Puppet staleness on tools-checker-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0]
[09:20:58] <wikibugs>	 06Labs, 06Operations: Good bug reports - https://phabricator.wikimedia.org/T146266#2655420 (10Jishnugopim)
[12:41:57] <shinken-wm>	 PROBLEM - SSH on tools-webgrid-lighttpd-1210 is CRITICAL: Server answer
[13:04:32] <CristianCantoro>	 Hi, I am trying to install jekyll on my toollabs project (wscontest) 
[13:04:39] <CristianCantoro>	 but I get this error:
[13:04:39] <CristianCantoro>	 ERROR:  Error installing jekyll:
[13:04:40] <CristianCantoro>	 	jekyll requires Ruby version >= 2.0.0.
[13:05:29] <CristianCantoro>	 I tried to install rvm (https://rvm.io) but it needs some dependencies, it would be very convenient to have it
[13:39:50] <wikibugs>	 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2655885 (10chasemp) p:05Triage>03Normal
[13:40:51] <wikibugs>	 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2653821 (10chasemp) thanks @yuvipanda  @madhuvishy you'll probably have to sync w/ @krenair or @Andrew on some DNS leak cleanup here :)
[13:41:52] <Lokal_Profil>	 !log tools.heritage Reverting local changes to categorize_images.py (T146278). The CI mechanisms are there for a reason
[13:41:53] <stashbot>	 T146278: categorize_images.py crashing due to malformated log message - https://phabricator.wikimedia.org/T146278
[13:41:56] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master
[13:45:59] <wikibugs>	 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2655916 (10AlexMonk-WMF) We looked into it last night, but weren't able to find the cause. We do know that the last instance to leave a reverse DNS entry behind was deleted around 2016-09-08 22...
[13:58:43] <wikibugs>	 06Labs, 10Labs-Infrastructure, 06Operations, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2655948 (10AlexMonk-WMF) >>! In T146212#2655916, @AlexMonk-WMF wrote: > We looked into it last night, but weren't able to find the cause. We...
[14:21:00] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 04-1] "I have been able to test this locally (will update the ReadMe to outline how)." (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) (owner: 10Lokal Profil)
[15:57:16] <grrrit-wm>	 (03PS2) 10Lokal Profil: Add Georgia in Georgian to database [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) 
[15:57:24] <grrrit-wm>	 (03CR) 10Lokal Profil: Add Georgia in Georgian to database (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) (owner: 10Lokal Profil)
[15:58:48] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] Add Georgia in Georgian to database [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) (owner: 10Lokal Profil)
[16:00:01] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add Georgia in Georgian to database [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) (owner: 10Lokal Profil)
[16:02:00] <wikibugs>	 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2656327 (10madhuvishy) @AlexMonk-WMF With the offsite coming up and needing to add exec nodes to handle load, do you think we can do a clean up now so we can pool these, and then look into the...
[16:16:14] <Krenair>	 madhuvishy, chasemp: I have a script that should clean this all up
[16:16:20] <Krenair>	 I could run it if it's okay with you
[16:16:57] <madhuvishy>	 Krenair: sure that would be great 
[16:16:59] <chasemp>	 Krenair: sounds good but let's wait for andrewbogott to be around just in case?
[16:17:14] <chasemp>	 things go wrong that is
[16:19:57] <madhuvishy>	 that sounds wise
[16:20:08] <madhuvishy>	 i'll be back in a few, walking to office
[16:20:22] <Krenair>	 madhuvishy, chasemp: it'll deal with existing entries, but won't prevent the same issue from recurring in future
[16:21:05] <Krenair>	 I'd like to add some extra logging to sink_nova_fixed_multi
[16:21:06] <chasemp>	 understood, def onboard for figuring this out but atm it's worth the bandaid I think
[16:21:38] <Krenair>	 (at a level that makes it actually log stuff, unless andrew can figure out how to fix that)
[16:25:54] <grrrit-wm>	 (03CR) 10Lokal Profil: "> (1 comment)" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) (owner: 10Lokal Profil)
[16:28:35] <bd808>	 CristianCantoro: Your best bet for newer ruby support would be for us to get a ruby image build for the kubernetes webservices. We run debian jessie base images on kuberntes and that would provide Ruby 2.1.5
[16:29:47] <grrrit-wm>	 (03CR) 10Lokal Profil: "> > (1 comment)" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/310889 (https://phabricator.wikimedia.org/T144772) (owner: 10Lokal Profil)
[16:36:06] <andrewbogott>	 Krenair: I'm here… you were going to clean up leaked dns entries?
[16:36:51] <Krenair>	 yes
[16:38:38] <andrewbogott>	 that's a script you've run before, right?  Or is it something new?
[16:39:26] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Add a easy way to run a ruby webservice on tools - https://phabricator.wikimedia.org/T141388#2496577 (10bd808) @MusikAnimal Debian Jessie has Ruby 2.1.5 packaged. Would that be new enough to replace your use of rbenv? What Ruby version are you using?  I think Ruby coul...
[16:40:14] <Krenair>	 andrewbogott, something new
[16:40:22] <Krenair>	 well, ish
[16:40:43] <Krenair>	 I wrote it over a month ago
[16:41:03] <andrewbogott>	 ok — I'm happy for you to run it if you're confident.
[16:41:07] <Krenair>	 always ran it with the write calls commented out, so it just said what it *would* do
[16:41:26] <andrewbogott>	 ah, sure, if you've already done a dry run so you know what it'll do, that seems safe
[16:46:59] <Krenair>	 andrewbogott, chasemp, madhuvishy: done
[16:48:58] <Krenair>	 output: https://phabricator.wikimedia.org/P4089
[16:50:49] <chasemp>	 that's a lot of cleanup
[16:51:01] <chasemp>	 we should ink this paste to the main task for historical ref?
[16:51:27] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Add a easy way to run a ruby webservice on tools - https://phabricator.wikimedia.org/T141388#2656489 (10MusikAnimal) @bd808 Yeah that should suffice. I'm OK with something other than Unicorn too, especially if it means it will take less than 5 min...
[16:52:56] <wikibugs>	 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#2656505 (10AlexMonk-WMF) {P4089}
[16:53:15] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1414 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms
[16:53:16] <wikibugs>	 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#2656510 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF
[16:53:37] <yuvipanda>	 !log fastcci enable backports on fastcci-puppetmaster
[16:53:40] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Fastcci/SAL, Master
[16:53:54] <chasemp>	 yuvipanda: what did you figure out w/ backports?  is it random where it's enable or a historical relic?
[16:54:16] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1407 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[16:54:22] <yuvipanda>	 chasemp: historical relic. we enabled it at some point, so older instances don't have it on
[16:54:27] <yuvipanda>	 (enabled in the image)
[16:54:39] <chasemp>	 thanks (was mainly curious)
[16:54:59] <yuvipanda>	 !log monitoring enable backports on filippo-test-trusty
[16:55:03] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Monitoring/SAL, Master
[16:55:16] <yuvipanda>	 !log graphite enable backports on graphite-labs
[16:55:20] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Graphite/SAL, Master
[16:55:32] <wikibugs>	 06Labs, 10Labs-Infrastructure, 06Operations, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2656525 (10AlexMonk-WMF) The script was run against real-labs in T120797 and most existing problem cases should be gone now
[16:56:12] <yuvipanda>	 !log etcd enable backports on master
[16:56:15] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Etcd/SAL, Master
[16:56:57] <shinken-wm>	 RECOVERY - SSH on tools-webgrid-lighttpd-1210 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0)
[16:57:08] <yuvipanda>	 !log mediawiki-core-team enabled backports on shaved-yak
[16:57:12] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mediawiki-core-team/SAL, Master
[16:57:31] <chasemp>	 !log tools reboot tools-webgrid-lighttpd-1407, tools-webgrid-lighttpd-1210, tools-webgrid-lighttpd-1414, and then tools-webgrid-lighttpd-1405 as the first 3 return
[16:57:35] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[16:57:57] <yuvipanda>	 !log wikidata-query enable backports on wqds-puppetmaster
[16:58:00] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query/SAL, Master
[16:58:12] <wikibugs>	 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2656533 (10AlexMonk-WMF) Cleanup script has been run for existing cases. ```krenair@bastion-01:~$ host tools-webgrid-lighttpd-1418  tools-webgrid-lighttpd-1418.eqiad.wmflabs has address 10.68.2...
[16:58:52] <wikibugs>	 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#1861262 (10hashar) \O/
[16:59:13] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:59:30] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:00:40] <tom29739>	 My new labs instance is having problems with puppet: 2016-09-21T16:54:28.235782+00:00 captcha-ai-02 rc.local[406]: #033[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to determined $::labsproject at /etc/puppet/manifests/realm.pp:58 on node captcha-ai-02.privpol-captcha.eqiad.wmflabs#033[0m
[17:01:05] <tom29739>	 And this: 2016-09-21T16:56:02.871263+00:00 captcha-ai-02 puppet-agent[519]: Could not request certificate: getaddrinfo: Name or service not known
[17:01:24] <chasemp>	 tom29739: is this a project w/ a intra-project master?
[17:01:30] <tom29739>	 Nope.
[17:01:35] <tom29739>	 It uses the labs puppetmaster.
[17:01:50] <chasemp>	 well that's super interesting
[17:02:26] <tom29739>	 I recall a couple of my other instances having that same problem ages ago.
[17:02:34] <chasemp>	 andrewbogott: ^ any reason project lookup would be failing here?  seems like the race condition we know about but in a new place
[17:02:43] <tom29739>	 But those instances had reused names, whereas this one doesn't
[17:03:04] <chasemp>	 tom29739: is it ssh-able?
[17:03:42] <tom29739>	 It asks for a password.
[17:04:20] <andrewbogott>	 I'll look
[17:08:01] <shinken-wm>	 RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [3600.0]
[17:08:21] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:08:46] <tom29739>	 Interestingly, it doesn't seem to have sent any Graphite data: https://tools.wmflabs.org/nagf/?project=privpol-captcha#h_captcha-ai-02_cpu
[17:09:39] <shinken-wm>	 RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [3600.0]
[17:10:27] <shinken-wm>	 RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [3600.0]
[17:15:07] <bd808>	 tom29739: if it's asking for a password it probably broke in the middle of the first puppet run and is generally busted
[17:15:14] <andrewbogott>	 chasemp, tom29739, this looks like the same hostname failure as usual… tom29739, do you mind just rebuilding?  Or is this happening consistently?
[17:16:41] <chasemp>	 andrewbogott: my main thought was extra bonus points for not happening in a project w/ its own master
[17:16:52] <chasemp>	 we'll have to dig into this post offiste man, this is like 6 times in a week or so
[17:16:56] <chasemp>	 (that I know of)
[17:26:19] <bd808>	 !log tools.admin Removed Mark Bergsma, Ryan Lane, and Coren from maintainers list. If any of them want back on I'll be glad to re-add them.
[17:26:23] <bd808>	 mark: ^
[17:26:23] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL, Master
[17:31:11] <tom29739>	 andrewbogott, the new new instance seems to be working. Thanks :)
[18:42:45] <madhuvishy>	 !log tools Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
[18:42:47] <stashbot>	 T146212: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212
[18:42:50] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[18:43:21] <chasemp>	 madhuvishy: can you run a few test on those new nodes to validate them? :) 
[18:43:43] <madhuvishy>	 chasemp: yeah - the repool script ran a test - it went okay
[18:43:46] <chasemp>	 sweet
[18:43:47] <madhuvishy>	 i'm watching it to see
[18:43:59] <chasemp>	 I don't know that the test thing would have caught the errors before?
[18:44:02] <chasemp>	 main reason I asked
[18:44:09] <chasemp>	 it was work type specific I think
[18:44:17] <madhuvishy>	 yeah - i think the issues happened when webservice start
[18:45:57] <madhuvishy>	 but it should be okay - there are no duplicate records in designate
[18:47:14] <madhuvishy>	 tools.jembot is scheduled there
[18:56:42] <madhuvishy>	 !log tools Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup
[18:56:43] <stashbot>	 T146212: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212
[18:56:46] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[18:57:52] <wikibugs>	 10Labs-Kubernetes, 06Community-Tech, 06Wikisource: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2657046 (10kaldari)
[19:37:20] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[20:03:54] <wikibugs>	 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2657258 (10madhuvishy) 05Open>03Resolved This is all done - tools-webgrid-lighttpd-1415, 1416, 1418 are up and running.
[20:27:21] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms
[20:29:57] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22)
[20:32:10] <wikibugs>	 06Labs, 07Tracking: Increase quota for tools project - https://phabricator.wikimedia.org/T146322#2657383 (10madhuvishy)
[20:47:28] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[20:49:05] <wikibugs>	 06Labs, 10Labs-Infrastructure: Intermittent ldap failures in designate-sink - https://phabricator.wikimedia.org/T146325#2657460 (10Andrew)
[20:51:04] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170)
[20:51:22] <wikibugs>	 06Labs, 10Labs-Infrastructure: Intermittent ldap failures in designate-sink - https://phabricator.wikimedia.org/T146325#2657487 (10Andrew) It looks to me like my code is racing to complete a transaction before the connection is closed on the server end.  I know that we have something automatically cleaning up...
[20:53:10] <shinken-wm>	 RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms
[20:56:08] <shinken-wm>	 PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218)
[21:26:26] <wikibugs>	 06Labs, 10Labs-Infrastructure: Intermittent ldap failures in designate-sink - https://phabricator.wikimedia.org/T146325#2657596 (10Andrew) OK, now I think that these designate_sink plugins are being run in a multi-threaded environment.  I'm going to rewrite this one to be thread safe and see if that makes the...
[21:38:09] <wikibugs>	 10Labs-Kubernetes, 06Community-Tech, 06Wikisource: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2657640 (10DannyH) p:05Triage>03Normal
[21:41:05] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, 06WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#2657643 (10bd808)
[21:47:30] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, and 4 others: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#2657656 (10bd808) a:03bd808 I have drafted two straw dog proposals and am proposing that they be discussed and refined...
[22:13:08] * bd808 spams everyone with https://meta.wikimedia.org/wiki/Requests_for_comment/Abandoned_Labs_tools links
[22:15:33] <tom29739>	 bd808: I've already read through it :P
[22:17:00] <bd808>	 tom29739: one part I would love to have some help figuring out: how and where do we call a consensus vote for tool labs?
[22:17:41] <tom29739>	 That's a good question
[22:17:48] <bd808>	 I did a vote on phab previously, but legoktm and others said they didn't like that
[22:18:06] <tom29739>	 It's clear that the RFC didn't work very well
[22:18:25] <tom29739>	 Maybe just something simple
[22:18:26] <bd808>	 yeah. it didn't get much participation
[22:18:41] <tom29739>	 Central notice maybe?
[22:19:01] <tom29739>	 Put it in the tool labs SSH notice?
[22:19:09] <bd808>	 One thing that seems to work on wikitech-l is posting that something is a done deal and then waiting a week for people to freak out :)
[22:20:44] <bd808>	 I think one of the problems with reaching the tool maintainers is that they really are on their home wikis and not irc/wikitech/mailing lists
[22:21:32] <bd808>	 there are certainly a few folks who are active here on irc but numerically not a very high percentage
[22:21:48] <bd808>	 and meta is ... not my favorite place to hang out
[22:23:52] <tom29739>	 Meta isn't the best place
[22:25:17] <tom29739>	 bd808, how have consensus votes been done in the past?
[22:25:52] <bd808>	 I'm honestly not sure its ever really been done. The only one I know of was the slowvote I did on Phab
[22:26:23] <bd808>	 That was https://phabricator.wikimedia.org/V7
[22:30:00] <bd808>	 tom29739: I may just survey some of the existing on-wiki voting systems and arbitrarily propose one for wikitech. We need a community :/
[22:31:17] <tom29739>	 That's the trouble I find with Labs as a whole.
[22:31:27] <tom29739>	 The community doesn't live on wikitech.
[22:31:51] <tom29739>	 It lives elsewhere, like on the maintainers home wikis
[22:32:53] <tom29739>	 Which presents a problem in some cases, because if the maintainer is German say, and puts his documentation on the German wiki..
[22:32:54] <tom29739>	 I can't speak German.
[22:33:20] <tom29739>	 Something like SecureVote would probably work
[22:33:31] <tom29739>	 Just changing it to make it insecure
[22:40:22] <legoktm>	 tom29739: votes aren't a good way to decide things.
[22:41:29] <tom29739>	 I kinda disagree with voting myself
[22:42:11] <tom29739>	 This is good: Editors might miss the best solution (or the best compromise) because it wasn't one of the options. This is especially problematic when there are complex or multiple issues involved. Establishing consensus requires expressing that opinion in terms other than a choice between discrete options, and expanding the reasoning behind it, addressing
[22:42:11] <tom29739>	 the points that others have left, until all come to a mutually agreeable solution.
[22:45:04] <Platonides>	 a vote is not the proper way for something so open as this
[22:45:08] <wikibugs>	 06Labs, 10Labs-Infrastructure: Intermittent ldap failures in designate-sink - https://phabricator.wikimedia.org/T146325#2657837 (10Andrew) https://gerrit.wikimedia.org/r/#/c/312127/
[22:56:22] <bd808>	 I think mostly what I'm looking for a small number of "omg this is the worst" responses honestly
[22:56:57] <bd808>	 like if 85% of people who actually do respond don't hate the whole thing then it's probably fine
[22:57:33] <bd808>	 But I really have never figured out how consensus is gauged
[23:12:27] <wikibugs>	 06Labs, 10Labs-Infrastructure: Intermittent ldap failures in designate-sink - https://phabricator.wikimedia.org/T146325#2657888 (10Andrew) 05Open>03Resolved Seems happy now!
[23:43:26] <wikibugs>	 06Labs: certcleaner.py uses ldap instance records.  Should instead  talk to keystone and nova - https://phabricator.wikimedia.org/T146303#2657921 (10AlexMonk-WMF) This isn't run by labs instances, is it? Just the main puppetmaster on labcontrol?