[00:35:40] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-worker-1019 is CRITICAL: CRITICAL: tools.tools-worker-1019.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found)tools.tools-worker-1019.diskspace.root.byte_percentfree (<40.00%)
[00:39:03] <yuvipanda>	 ori: ah, the fun problems of gridengine :)
[00:39:48] <bd808>	 !log tools.morebots All bots shutdown. Stashbot is now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng
[00:39:49] <stashbot>	 Unknown project "tools.morebots"
[00:40:06] <bd808>	 uhh...
[00:40:07] <yuvipanda>	 bd808: \o/
[00:40:31] <bd808>	 what's with the unknown project message?
[00:40:45] <bd808>	 !log tools.stashbot now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng
[00:40:45] <stashbot>	 Unknown project "tools.stashbot"
[00:40:48] <bd808>	 frack
[00:41:07] <bd808>	 !log striker Test
[00:41:07] <stashbot>	 Unknown project "striker"
[00:41:33] <bd808>	 ldap barf
[00:41:46] <godog>	 sad_trombone.wav
[00:41:48] <bd808>	 Exception getting LDAP data for ou=projects,dc=wikimedia,dc=org
[00:42:27] <bd808>	 At least that doesn't crash the bot anymore ;)
[00:43:41] <bd808>	 the first error was "LDAPSessionTerminatedByServerError: session terminated by server"
[00:44:29] <bd808>	 I wonder if ldap3 can recover from that or if I need to explicitly create a new connection?
[00:51:12] <Krenair>	 I thought we didn't use ou=projects,dc=wikimedia,dc=org anymore
[00:57:28] <bd808>	 Krenair: that's actually quite possible. I cribbed the LDAP code from admin bot quite a long time ago
[00:57:45] <bd808>	 !log tools.stashbot Is LDAP happy yet?
[00:57:46] <stashbot>	 Unknown project "tools.stashbot"
[00:57:50] <bd808>	 frack
[00:58:02] <Krenair>	 we have some task, hang on
[00:58:14] <Krenair>	 here: https://phabricator.wikimedia.org/T138150
[00:59:07] <bd808>	 right. I've looked at that recently because of the puppet cron spam complaint
[00:59:35] <bd808>	 so yeah I guess I need to change how the project name validation here is done and try to make the ldap connection self-healing
[01:05:42] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-worker-1019 is OK: OK: tools.tools-worker-1019.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found)
[01:17:36] <bd808>	 !log tools.stashbot Is LDAP happy yet?
[01:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL
[01:17:51] <bd808>	 !log tools.stashbot now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng
[01:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL
[01:18:05] <bd808>	 !log tools.morebots All bots shutdown. Stashbot is now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng
[01:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.morebots/SAL
[01:25:35] <bd808>	 yuvipanda: it's also 2 fewer jobs running on precise :)
[01:25:45] <yuvipanda>	 \o/
[01:25:50] <yuvipanda>	 and another pod on k8s :D
[01:26:11] <yuvipanda>	 bd808: you can accurately track your pod's memory and CPU usage now
[01:26:35] <bd808>	 I hope its low
[01:26:50] <bd808>	 its a stream based message processor after all
[01:27:29] <yuvipanda>	 bd808: https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats?var-namespace=stashbot
[01:28:02] <yuvipanda>	 I need to spend time on these dashboards too
[01:28:04] <bd808>	 I like the trend of this graph -- https://graphite-labs.wikimedia.org/render/?width=600&height=300&target=cactiStyle(sumSeries(tools.tools-services-01.sge.hosts.tools*12*.job_count))&target=cactiStyle(sumSeries(tools.tools-services-01.sge.hosts.tools*14*.job_count))&from=-3d
[01:29:08] <yuvipanda>	 nice
[02:10:03] <wikibugs>	 10Labs-project-Wikistats, 13Patch-For-Review: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750946 (10Dzahn) >>! In T149000#2748917, @hashar wrote: > Most probably because we list http and they have switched to https with the script not following redirects? :]  That is exactly right, the...
[02:11:21] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750950 (10Dzahn)
[02:12:56] <bd808>	 Krenair: for the project entires in LDAP, it sure looks to me like OSM creates cn=$NAME,ou=projects,dc=wikimedia,dc=org
[02:18:35] <bd808>	 the roles could be managed differently. That part is delegated to Nova in the OSM code
[02:20:38] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750954 (10Dzahn) merged, deployed, ran update script  table looks much better now: http://wikistats.wmflabs.org/display.php?t=w3  some of them became 404 meanwhile, and maybe there are new ones, dont know  I tried usin...
[02:20:45] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750956 (10Dzahn) 05Open>03Resolved
[02:23:10] <wikibugs>	 10Labs-project-Wikistats: LXDE wiki updates broken - https://phabricator.wikimedia.org/T149396#2750958 (10Dzahn)
[02:23:34] <wikibugs>	 10Labs-project-Wikistats: LXDE wiki updates broken - https://phabricator.wikimedia.org/T149396#2750973 (10Dzahn)
[05:18:03] <wikibugs>	 06Labs, 10Adminbot: Get a cloak for morebots & labs-morebots - https://phabricator.wikimedia.org/T140547#2468752 (10bd808) Stashbot has taken over the `!log` duties in all channels for morebots and it does have a cloak.
[05:18:19] <wikibugs>	 06Labs, 10Adminbot: Get a cloak for morebots & labs-morebots - https://phabricator.wikimedia.org/T140547#2751110 (10bd808) p:05Normal>03Low
[05:52:09] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2751134 (10RobiH) List of W3C Communities: https://www.w3.org/community/groups/ Add all that did not exist.  Half of them are not wikis.  By error Code, they can be deleted afterwards.
[05:55:41] <wikibugs>	 10Labs-project-Wikistats: LXDE wiki updates broken - https://phabricator.wikimedia.org/T149396#2751135 (10RobiH) Same here: Change http://wiki.lxde.org/en/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5 into https://wiki.lxde.org/en/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5&form...
[06:34:11] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[07:14:11] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:40:28] <grrrit-wm>	 (03PS2) 10Platonides: Add an "authenticate" command for identifying with nickserv after connection [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 (https://phabricator.wikimedia.org/T149265) 
[07:40:55] <grrrit-wm>	 (03CR) 10MarcoAurelio: "Glaisher, mind having a look at this? Thanks." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 (https://phabricator.wikimedia.org/T149265) (owner: 10Platonides)
[07:41:29] <wikibugs>	 10Tool-Labs-tools-stewardbots, 13Patch-For-Review: StewardBot not logged into irc - https://phabricator.wikimedia.org/T149265#2751200 (10MarcoAurelio) p:05Triage>03Low
[07:53:02] <Wesalius>	 how come quickstatements are not listed at http://tools.wmflabs.org/?status
[07:53:05] <Wesalius>	 ?
[07:53:22] <Wesalius>	 they are hosted at tools: http://tools.wmflabs.org/wikidata-todo/quick_statements.php
[07:55:42] <abbe98[m]>	 It's hosted at wikidata-todo/ which is listed
[08:02:51] <wikibugs>	 10Tool-Labs-tools-stewardbots: Evaluate cleanup on StewardBot's code - https://phabricator.wikimedia.org/T149404#2751215 (10MarcoAurelio)
[08:02:55] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2751228 (10hashar) @Dzahn apparently the wikistat script has support to fix such URLs!  Reading `usr/lib/wikistats/update.php` one might be able to just:      update.php w3 fixit  That get from the database all the url...
[08:05:01] <wikibugs>	 10Tool-Labs-tools-stewardbots: General update of HTML and CSS for stewardbot tools and portals - https://phabricator.wikimedia.org/T128745#2751233 (10MarcoAurelio) 05Open>03Resolved a:03MarcoAurelio Mainly done. I'll try to fix other minor issues when they appear.
[08:05:04] <wikibugs>	 06Labs, 10Tool-Labs: Improve outcome of http://tools.wmflabs.org/?status proposal - https://phabricator.wikimedia.org/T149405#2751237 (10Wesalius)
[08:05:34] <Wesalius>	 abbe98[m]: thanks
[08:06:49] <wikibugs>	 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config, 13Patch-For-Review: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751251 (10MarcoAurelio) Okay, so, to summarize, we have PHP lint tests running. We still lack (IMO):  * python checks * and maybe...
[08:07:04] <wikibugs>	 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751252 (10MarcoAurelio)
[09:34:29] <doctaxon>	 Hi! Running a script on the labs instance dwl I get the error '35 SSL connect error. The SSL handshaking failed.' so about one time the hour. What is the reason why?
[09:41:57] <doctaxon>	 jynus: can you say something to this ^
[09:47:09] <doctaxon>	 valhallasw`vecto: Hi! Running a script on the labs instance dwl I get the error '35 SSL connect error. The SSL handshaking failed.' so about one time the hour. What is the reason why?
[09:49:24] <valhallasw`vecto>	 Doctaxon: connecting to where? Does curl/wger work? Is it just from that host or from all labs hosts or also from home, etc....
[09:50:42] <doctaxon>	 connecting to where? Ahm, API
[09:51:03] <doctaxon>	 sorry, i am not very familiar with that
[10:00:03] <doctaxon>	 valhallasw`vecto: can you tell me first, what an error that is? SSL handshaking?
[10:09:27] <doctaxon>	 valhallasw`vecto: is it possible that the session key gets lost or changes during data transfer between client and server?
[10:18:18] <Nemo_bis>	 >  Puppet is failing to run on the "dumps-stats.dumps.eqiad.wmflabs" instance in Wikimedia Labs.
[10:22:14] <grrrit-wm>	 (03PS1) 10Gehel: fixed default configuration for maps / postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/318516 
[10:35:29] <wikibugs>	 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751559 (10hashar) The CI definition in integration/config.git zuul/layout.yaml is: ```   - name: labs/tools/stewardbots     template:       - name: comp...
[10:39:27] <grrrit-wm>	 (03CR) 10Gehel: [C: 032 V: 032] fixed default configuration for maps / postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/318516 (owner: 10Gehel)
[10:41:06] <valhallasw`vecto>	 Doctaxon: i don't know? Something with ssl not being able to set up the connection
[10:41:51] <valhallasw`vecto>	 So either something with your vm, something with the network or something with the server you're connecting to...
[10:42:06] <doctaxon>	 the connection is set up, but it breaks after an hour or two
[10:42:42] <doctaxon>	 i am connecting to the api servers
[10:42:52] <valhallasw`vecto>	 Handshake is the initial phase of the connection
[10:43:13] <valhallasw`vecto>	 So it's not set up (anymore?) at that point
[10:43:34] <doctaxon>	 yes, but I get the error during the connection, it breaks with this error report
[10:44:24] <valhallasw`vecto>	 So apparently you're reconnecting and that fails?
[10:44:46] <doctaxon>	 making an api interrogation
[10:47:33] <doctaxon>	 the scripts makes some api interrogations to get some data, and then immediately the error 35 occurs and the script breaks
[10:57:35] <valhallasw`vecto>	 So debug what your http stack is doing...?
[11:36:31] <grrrit-wm>	 (03PS1) 10Hashar: Introduce tox + flake8 [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318521 (https://phabricator.wikimedia.org/T128503) 
[11:36:52] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318521 (https://phabricator.wikimedia.org/T128503) (owner: 10Hashar)
[11:37:59] <wikibugs>	 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config, 13Patch-For-Review: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751642 (10hashar) 05stalled>03Open https://gerrit.wikimedia.org/r/318521 is a first pass and should be a good base to build up...
[12:45:55] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms
[12:53:36] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170)
[13:42:52] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2751913 (10jcrespo)
[13:43:44] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2465396 (10jcrespo)
[13:43:47] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2751915 (10jcrespo)
[13:44:34] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 07Epic, 07Tracking: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788#2751919 (10jcrespo)
[13:44:37] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2546917 (10jcrespo)
[14:13:55] <wikibugs>	 06Labs, 10Horizon, 13Patch-For-Review: Add a 'remember me' feature to Horizon - https://phabricator.wikimedia.org/T149036#2752010 (10Andrew) 05Open>03Resolved
[14:14:25] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 200.52 ms
[14:17:18] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2752030 (10jcrespo)
[14:17:43] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22)
[14:26:12] <shinken-wm>	 RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[14:28:30] <shinken-wm>	 PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218)
[14:34:37] <wikibugs>	 06Labs: Request increased quota for services-test labs project - https://phabricator.wikimedia.org/T148869#2752062 (10Eevans) >>! In T148869#2738712, @Andrew wrote: > OK, quotas are increased.  Please make a note on this ticket when you're done debugging and I'll move the quotas back to their previous values: >...
[14:43:10] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Reinhard Kraasch was created, changed by Reinhard Kraasch link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Reinhard_Kraasch edit summary: Created page with "{{Tools Access Request |Justification=I want to migrate some of my bot tasks to tool labs   ([[:de:User:RKBot]] and [[:commons:User:RKBot]]) |Completed=false |User Name=Reinha..."
[15:06:27] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Reinhard Kraasch was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=936626 edit summary: 
[15:23:25] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-docker-builder-01 is OK: OK: tools.tools-docker-builder-01.diskspace.root.byte_percentfree (More than half of the datapoints are undefined) tools.tools-docker-builder-01.diskspace._srv.byte_percentfree (More than half of the datapoints are undefined)
[15:23:36] <wikibugs>	 06Labs, 06Operations, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2752201 (10chasemp) @madhuvishy  thoughts on truncating the disposal >10G files and kicking off an update of rsync over the weekend w/ the tree largest excluded for no...
[15:25:07] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:25:08] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:25:16] <ori>	 name resolution is not working on tools, in case it's not known
[15:25:17] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:25:17] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:25:18] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[15:25:18] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:26:30] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:26:30] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0]
[15:26:31] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0]
[15:26:39] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0]
[15:26:41] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-builder-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:26:47] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0]
[15:26:48] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0]
[15:26:49] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]
[15:26:51] <shinken-wm>	 PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[15:26:52] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]
[15:26:55] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0]
[15:26:57] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0]
[15:27:00] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:27:00] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:27:00] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:27:02] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:27:06] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0]
[15:27:06] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:27:06] <shinken-wm>	 PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:27:07] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0]
[15:27:11] <bd808>	 ori: I think everything is expected to be messed up while the reboots are ongoing
[15:27:12] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]
[15:27:14] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[15:27:16] <shinken-wm>	 PROBLEM - Host ToolLabs is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org
[15:27:36] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]
[15:27:38] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0]
[15:28:37] <chasemp>	 madhuvishy: can you change topic here for maint?
[15:28:39] <chasemp>	 I can't seem to tod it
[15:28:45] <chasemp>	 to do it even
[15:28:53] <bd808>	 bah. need to +o
[15:29:36] <niko>	 sorry, but shinken-wm missed the internal-server-nat.wmflabs.org hostmask, and 208.80.155.255 instead, so Sigyn was triggered
[15:30:23] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0]
[15:30:35] <ori>	 thanks madhuvishy
[15:30:48] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:31:14] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:31:15] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:31:17] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:31:23] <shinken-wm>	 PROBLEM - Puppet run on tools-logs-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:31:25] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[15:31:25] <shinken-wm>	 PROBLEM - Puppet run on tools-static-10 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0]
[15:31:35] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:31:37] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1419 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[15:31:39] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[15:31:41] <chasemp>	 ori: in theory everything is ok now, tho we aren't out of maint period as defined yet just fyi
[15:31:41] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[15:31:42] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[15:32:14] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1015 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:32:22] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1202 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[15:32:29] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:32:29] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1014 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0]
[15:32:37] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]
[15:32:37] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[15:32:40] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[15:32:49] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[15:33:04] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:33:04] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1413 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:35:44] <shinken-wm>	 RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms
[15:36:48] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1416 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[15:37:12] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:37:20] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:37:23] <bd808>	 I apparently don't have enough rights to remove the f'ing topiclock
[15:38:32] <Krenair>	 like that?
[15:38:35] <Krenair>	 bd808
[15:38:40] <bd808>	 :) thanks
[15:38:46] <Krenair>	 you're already opped
[15:38:52] * bd808 has weak irc fu
[15:39:02] <Krenair>	 so it should just be a case of /mode #wikimedia-labs -t
[15:39:24] <bd808>	 I was trying "set #wikimedia-labs topiclock off" with chanserv
[15:39:33] <bd808>	 I guess +t is not the same thing
[15:40:21] <Krenair>	 I don't think topiclock is on in this channel?
[15:40:39] <bd808>	 *nod* just me being confused
[15:41:17] <Krenair>	 hmm. what prompted you to look at topiclock stuff?
[15:41:41] <bd808>	 the fact that only ops could change the status
[15:41:55] <bd808>	 like I said, weak irc fu
[15:42:19] <Krenair>	 oh yeah, that's controlled by +t
[15:42:30] <Krenair>	 ok
[15:43:15] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:43:37] <chasemp>	 !log tools restart toolschecker service on 01 and 02
[15:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:45:43] <shinken-wm>	 RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:45:47] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1414 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:46:01] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:47:09] <shinken-wm>	 RECOVERY - Puppet run on tools-prometheus-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:47:13] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:47:31] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:47:46] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:47:54] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:48:24] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:51:10] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:51:16] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:51:20] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:51:34] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:51:50] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:52:01] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:52:43] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:53:21] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:05] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:07] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:19] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:21] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:27] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:27] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:29] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:35] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:41] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:49] <shinken-wm>	 RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0]
[15:55:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:11] <shinken-wm>	 RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:26] <shinken-wm>	 RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:28] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:38] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:42] <shinken-wm>	 RECOVERY - Puppet run on tools-docker-builder-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:48] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:50] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:52] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:54] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:56:58] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:57:04] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:57:06] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:58:32] <Yuvi[m]>	 !log tools restart k8s master, seems to have run out of fds
[15:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:00:20] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:21] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1008 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:26] <shinken-wm>	 RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:31] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:37] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:39] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:49] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:00:49] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:13] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:17] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:39] <shinken-wm>	 RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:41] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:48] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:57] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:01:59] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:01] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:03] <shinken-wm>	 RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:03] <shinken-wm>	 RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:13] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:05:16] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:05:24] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:05:32] <shinken-wm>	 RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:05:58] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:10] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:14] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:16] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:16] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:17] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:24] <shinken-wm>	 RECOVERY - Puppet run on tools-logs-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:28] <shinken-wm>	 RECOVERY - Puppet run on tools-static-10 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:29] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:30] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:46] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:49] <shinken-wm>	 RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:51] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:51] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:51] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:55] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:07:05] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:07:05] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:07:15] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:08:05] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:11:25] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:11:41] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:11:43] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:13] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:22] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:26] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:28] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:38] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:38] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:38] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:12:50] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:13:04] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:13:28] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:16:26] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:16:48] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:25:11] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0]
[16:25:13] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:41:31] <wikibugs>	 06Labs, 06Operations, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2752383 (10madhuvishy) Started another sync now after truncating the >10G error/access log files from the above comment. New command (no >10G exclusion):  ``` rsync --...
[18:23:42] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 07Tracking: Packages to be installed in Tool Labs Kubernetes Images (Tracking) - https://phabricator.wikimedia.org/T140110#2752647 (10yuvipanda)
[18:23:44] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Install dependencies for python-lxml in python container - https://phabricator.wikimedia.org/T140117#2752644 (10yuvipanda) 05Open>03Resolved a:03yuvipanda This is done
[18:24:30] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Flannel is sometimes flaky - https://phabricator.wikimedia.org/T139707#2752651 (10yuvipanda) 05Open>03Resolved a:03yuvipanda This was caused by etcd nodes dying because of T140256 - is fine now.
[18:24:52] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 05Goal: Goal: Allow using k8s instead of GridEngine as a backend for webservices - https://phabricator.wikimedia.org/T129309#2752656 (10yuvipanda) 05Open>03Resolved a:03yuvipanda
[19:15:35] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: python (python3 only) kubernetes image missing virtualenv command - https://phabricator.wikimedia.org/T149441#2752743 (10bd808)
[19:21:11] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: python (python3 only) kubernetes image missing virtualenv command - https://phabricator.wikimedia.org/T149441#2752766 (10bd808) The `python3 -m venv` failure is an upstream bug in the Debian packages: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=732703
[19:26:33] <wikibugs>	 06Labs, 10Horizon, 13Patch-For-Review: Horizon dashboard for managing instance puppet config - https://phabricator.wikimedia.org/T91990#2752793 (10Andrew)
[19:26:35] <wikibugs>	 06Labs, 10Horizon: Figure out type coercion rules for puppet parameter config in horizon UI - https://phabricator.wikimedia.org/T137835#2752791 (10Andrew) 05Open>03Resolved My tests have all produced reasonable behavior so far, so I think we're good.
[19:36:49] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2752827 (10leila) @bd808 feel free to send a reminder that the survey will close in a week. You can also do this 3 days before the survey closes.
[19:48:51] <Krenair>	 !log tools.stewardbots I forgot to log this earlier: Found the bot was down, started the bot
[19:49:26] <Krenair>	 !log tools.stewardbots It got killed again... restarted
[20:01:38] <bd808>	 Krenair: ^ stashbot is back. The pod restarted for some reason and I had an error in the config file :/
[20:02:09] <Krenair>	 !log tools.stewardbots I forgot to log this earlier: Found the bot was down, started the bot
[20:02:10] <bd808>	 !log tools.statshbot Restarted bot that had crashed and wasn't self-starting due to syntax error in config
[20:02:10] <Krenair>	 !log tools.stewardbots It got killed again... restarted
[20:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL
[20:02:12] <stashbot>	 Unknown project "tools.statshbot"
[20:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL
[20:02:20] <bd808>	 !log tools.stashbot Restarted bot that had crashed and wasn't self-starting due to syntax error in config
[20:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL
[20:06:04] <yuvipanda>	 !log tools restart kube-apiserver again, ran into too many open file handles
[20:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:06:14] <yuvipanda>	 bd808: looks like it's hitting a 1024 open files limit
[20:06:20] <yuvipanda>	 bd808: gonna raise that
[20:06:43] <bd808>	 1024 is a tiny number of file handles for a server process
[20:07:00] <yuvipanda>	 yeah
[20:07:07] <yuvipanda>	 bd808: not sure where that comes from
[20:07:09] <yuvipanda>	 so tracking it down
[20:07:24] <bd808>	 it's stock for cli processes
[20:07:41] <bd808>	 something somewhere needs a ulimit in its startup
[20:15:35] <chasemp>	 !log restart prometheus service on tools-prometheus-01 to see if that wakes it up
[20:15:35] <stashbot>	 Unknown project "restart"
[20:15:40] <chasemp>	 !log tools restart prometheus service on tools-prometheus-01 to see if that wakes it up
[20:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:16:01] <chasemp>	 nope
[20:16:11] <chasemp>	 yeah I don't know enough about prometheus
[20:16:18] <yuvipanda>	 chasemp: you want prometheus blackbox exporter
[20:16:21] <yuvipanda>	 that's what does ssh / http checks
[20:16:25] <chasemp>	 I restarted it
[20:16:42] <yuvipanda>	 right
[20:16:44] <yuvipanda>	 the other option
[20:16:46] <yuvipanda>	 is that it's an iptables issue
[20:16:52] <yuvipanda>	 and the iptables rule that opens up holes
[20:16:59] <yuvipanda>	 for tools-prometheus-02 to hit ssh
[20:17:01] <yuvipanda>	 is failing somehow
[20:17:26] <yuvipanda>	 bd808: ok let's see how this goes. Raised the limit by a large number now
[20:18:31] <chasemp>	 yuvipanda: both tools-prometheus nodes can ssh to a vm in tools I see reported as failing
[20:18:33] <chasemp>	 telnet tools-exec-1404 22
[20:18:34] <chasemp>	 Trying 10.68.18.12...
[20:18:34] <chasemp>	 Connected to tools-exec-1404.tools.eqiad.wmflabs.
[20:18:39] <chasemp>	 thus confusion on my part
[20:18:42] <yuvipanda>	 chasemp: ah, ok
[20:18:43] <yuvipanda>	 right
[20:18:50] <yuvipanda>	 yeah looks like blackbox exporter having a bad time
[20:18:57] <yuvipanda>	 I'll check
[20:19:05] <yuvipanda>	 godog: ^ fyi (blackbox exporter freakin out)
[20:19:12] <chasemp>	 does it do a representative check or actually just look at ssh service?
[20:19:17] <chasemp>	 is it watching x to watch y etc
[20:19:31] <yuvipanda>	 it makes a network connection
[20:19:34] <yuvipanda>	 and checks for the ssh headers
[20:19:44] <chasemp>	 ok then yeah I'm confused
[20:20:12] <chasemp>	 all things being equal it should be succeeding
[20:20:17] <yuvipanda>	 chasemp: yeah
[20:22:05] <yuvipanda>	 yuvipanda@tools-prometheus-01:~$ curl 'localhost:9115/probe?target=tools-worker-1011.tools.eqiad.wmflabs&module=ssh_banner'
[20:22:07] <yuvipanda>	 returns 0
[20:22:10] <yuvipanda>	 should be returning 1
[20:23:35] <chasemp>	 yeah
[20:23:37] <chasemp>	 root@tools-prometheus-02:~# curl 'localhost:9115/probe?target=tools-worker-1014.tools.eqiad.wmflabs&module=ssh_banner'
[20:23:37] <chasemp>	 probe_duration_seconds 0.000010
[20:23:38] <chasemp>	 probe_success 0
[20:23:39] <chasemp>	 root@tools-prometheus-02:~# telnet tools-exec-1404 22
[20:23:41] <chasemp>	 Trying 10.68.18.12...
[20:23:43] <chasemp>	 Connected to tools-exec-1404.tools.eqiad.wmflabs.
[20:23:45] <chasemp>	 Escape character is '^]'
[20:23:49] <chasemp>	 works but doesn't work
[20:24:07] <yuvipanda>	 right
[20:24:16] <yuvipanda>	 chasemp: hmm trying to tcpdump
[20:24:32] <yuvipanda>	 chasemp: actually, PAWS is down, so I'm going to look into that first
[20:24:35] <chasemp>	 k
[20:41:50] <godog>	 yuvipanda: indeed, I'll take a look
[20:42:17] <yuvipanda>	 !log tools.paws stop all user containers
[20:42:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.paws/SAL
[20:42:27] <yuvipanda>	 I really need to finish up my kubernetes spawner refactoring and deploy it
[20:48:10] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: python (python3 only) kubernetes image missing virtualenv command - https://phabricator.wikimedia.org/T149441#2752961 (10bd808) 05Open>03Resolved a:03yuvipanda Verified that `python3 -m venv ...` works.
[20:48:53] <yuvipanda>	 bd808: thanks for confirming :D
[20:49:07] <bd808>	 tobias47n9e-c: ^ creating a virtualenv with `python -m venv path/to/create` should work now
[20:51:19] <godog>	 yuvipanda: looking with strace doesn't seem to even attempt to connect heh
[20:51:29] <yuvipanda>	 :(
[20:53:12] <chasemp>	 yuvipanda: seems you can bump dockers pull concurrency to help with layer management per max-concurrent-downloads (instead of sequential)
[20:53:22] <chasemp>	 wondering if that isn't germane to our pull timeout issue
[20:54:10] <yuvipanda>	 it isn't a pull timeout tho
[20:54:17] <yuvipanda>	 it is a pod start timeout
[20:54:24] <yuvipanda>	 caused by a pull
[20:54:39] <yuvipanda>	 don't think that'll affect us too much
[20:54:51] <yuvipanda>	 i should have a pre-puller done in a but :)
[20:55:40] <yuvipanda>	 chasemp: basically, i'll have an 'update images' script that uses clush to hit the builder, build image, then pull on the workers
[20:55:42] <chasemp>	 pre-puller is more savvy I think, but my idea was pod start timeouts are because of pull duration where we are capped on max layer pulling concurrency
[20:55:44] <yuvipanda>	 better than cron
[20:56:05] <yuvipanda>	 right that could be one cause
[20:56:21] <yuvipanda>	 and we can probably up it since our registry isn't overloaded
[20:56:28] <chasemp>	 that would also explain why we see it more as our images grow
[20:56:44] <chasemp>	 but that's also helped by pre-seeding
[20:56:45] <yuvipanda>	 they have had same number of layers tho
[20:57:08] <yuvipanda>	 only adding packages to same layers as before
[20:57:43] <chasemp>	 your thinking atm is when images are built to run something that prepopulates in the moment
[20:57:49] <chasemp>	 taking that op out of line w/ actual scheduling
[20:58:10] <chasemp>	 that was a question / clarifying statement :)
[20:58:32] <yuvipanda>	 yes
[20:58:37] <yuvipanda>	 push on buld
[20:58:40] <yuvipanda>	 *build
[20:58:46] <yuvipanda>	 not pull every 5 mins
[20:58:53] <chasemp>	 sounds good, that probaby solves 10 different issues we could face in one stroke
[20:59:02] <chasemp>	 and maximum consistency and predictability
[20:59:19] <yuvipanda>	 yup
[20:59:35] <yuvipanda>	 eventual consistency on this doesn' sound fun
[20:59:46] <yuvipanda>	 chasemp: i'll merge that clush patch later today too
[21:00:23] <chasemp>	 not to mention either all workers can host all base images or we have to segregate work loads anyhow so it's almost no cost 
[21:00:43] <chasemp>	 yuvipanda: k
[21:01:02] <yuvipanda>	 yeah
[21:01:07] <yuvipanda>	 exactly
[21:01:33] <chasemp>	 I'm with it, just making sure I understand :)
[21:01:48] <yuvipanda>	 yeah :)
[21:01:56] <yuvipanda>	 this needed clush root pers
[21:02:54] <yuvipanda>	 *perms
[21:14:00] <godog>	 yuvipanda: I have to afk for an errand for 45min or so, it seems like an issue with the binary itself though
[21:14:13] <yuvipanda>	 :(
[21:14:15] <godog>	 probably a good occasion to update it too, upstream has been developing more features
[21:14:21] <yuvipanda>	 not sure why it just popped up
[21:14:22] <yuvipanda>	 yeah
[21:14:39] <godog>	 my wild guess is not restarter after the latest upgrade?
[21:14:41] <godog>	 bbl
[21:18:15] <tobias47n9e-c>	 bd808: thanks. I got the venv to work. I am still working on the deployment. Hope to get it working this weekend and then I will update some of the docs.
[21:18:43] <bd808>	 tobias47n9e-c: awesome :)
[21:32:11] <dennyvrandecic_>	 curious, does labs track how often pages are being assessed, or would I need to implement my own tracking in case I would like to keep track?
[21:33:20] <hare>	 Tool Labs question: how do I activate the web server? Do I need to create a public_html folder?
[21:33:22] <bd808>	 dennyvrandecic_: we only have bulk measurements right now. No tracking of specific URLs
[21:33:42] <dennyvrandecic_>	 how bulky is bulk measurement?
[21:33:44] <bd808>	 hare: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web
[21:33:53] <hare>	 thank you
[21:34:06] <bd808>	 dennyvrandecic_: really bulk. http response codes per unit time
[21:34:26] <dennyvrandecic_>	 ah, so not even per project? gotcha, thanks
[21:34:29] <bd808>	 https://grafana-labs.wikimedia.org/dashboard/db/tools-activity
[21:35:24] <bd808>	 I would like to get something that tracks "popularity" by tool setup, but I have a lot of other things to get done that seem more important right now
[21:36:16] <mutante>	 like the "popcon" package in Debian?
[21:36:26] <bd808>	 graphite will be a poor match for storing that sort of timeseries data because of the number of tools
[21:37:43] <yuvipanda>	 dennyvrandecic_: bd808 it is per tool 
[21:37:56] <bd808>	 yuvipanda: it is?
[21:38:10] <yuvipanda>	 bd808: yeah https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats
[21:38:22] <bd808>	 ah for k8s
[21:38:32] <yuvipanda>	 bd808: so that's a misnomer
[21:38:38] <yuvipanda>	 bd808: the web request tracking works for all tools
[21:38:48] <yuvipanda>	 bd808: it's just not exposed on a dashboard anywhere, I guess
[21:38:53] <yuvipanda>	 so might as well not exist, heh
[21:38:55] <yuvipanda>	 dennyvrandecic_: is your web tool php?
[21:39:03] <dennyvrandecic_>	 yuvipanda: plain html
[21:39:13] <yuvipanda>	 dennyvrandecic_: what is its name?
[21:39:13] <bd808>	 neat. I only knew about the tools.reqstats.combined.* metrics
[21:39:19] <dennyvrandecic_>	 with some js
[21:39:35] <yuvipanda>	 bd808: yeah, there's per tool stuff. that isn't http code aggregated tho, just overall counts
[21:39:47] <dennyvrandecic_>	 that's fine
[21:39:50] <bd808>	 *nod* that would be fine really
[21:39:53] <yuvipanda>	 dennyvrandecic_: what's the name of the tool? :)
[21:39:59] <dennyvrandecic_>	 everythingisconnected
[21:40:05] <yuvipanda>	 moment
[21:40:53] <yuvipanda>	 !log tools.everythingisconnected move to kubernetes, easier stats dashboard
[21:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.everythingisconnected/SAL
[21:41:08] <yuvipanda>	 dennyvrandecic_: it should show up in the dashboard dropdown at https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats shortly
[21:41:23] <dennyvrandecic_>	 yuvipanda: that's awesome! thank you!
[21:41:43] <yuvipanda>	 !log tools.everythingisconnected move accomplished via webservice stop && webservice --backend=kubernetes start, which works for plain html / js (static) and php web applications
[21:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.everythingisconnected/SAL
[21:42:13] <bd808>	 dennyvrandecic_: you will need to remember to use `webservice --backend=kubernetes start` when and if you restart it
[21:42:28] <dennyvrandecic_>	 ok, but for now I don't have to restart it?
[21:42:35] <yuvipanda>	 bd808: not necessarily. we drop a bit in service.manifest that takes care of that
[21:42:36] <dennyvrandecic_>	 I'll write that down :)
[21:42:59] <bd808>	 yuvipanda: but webservice stops rm's the manifest does it not?
[21:43:06] <yuvipanda>	 bd808: ah yes, but restart doesn't
[21:43:12] * bd808 things we talked about this on a bug
[21:43:26] <yuvipanda>	 bd808: so if you do webservice stop for some reason, yes you'll have to do --backend=kubernetes
[21:43:32] <yuvipanda>	 bd808: yeah, I remember.
[21:43:49] <yuvipanda>	 dennyvrandecic_: but yeah, you shouldn't need to do even restarts - it's a plain html / js (static) app, so should be fine :)
[21:44:00] <yuvipanda>	 bd808: so many things to implement, etc :(
[21:44:15] <dennyvrandecic_>	 cool! thank you
[21:44:22] <dennyvrandecic_>	 I'll wait for it to appear in the dashboard :)
[21:44:33] <yuvipanda>	 dennyvrandecic_: :)
[21:44:44] <yuvipanda>	 bd808: we should also maybe make a only-webrequests dashboard
[21:44:59] <yuvipanda>	 shouldn't be too hard, it's all in graphite
[21:44:59] <bd808>	 as someone who is very used to `x restart` not actually working I'm kind of hard wired to do stop && start instead. I know that I should get over that for k8s because it does fancy things with restart
[21:45:26] <bd808>	 yuvipanda: we should. and really I'd like to wire that into striker
[21:45:52] <yuvipanda>	 bd808: yeah, you can just make a json request to graphite with tool name and get back data :D
[21:46:13] <yuvipanda>	 (brb)
[21:57:25] <wikibugs>	 10Striker: Add webservice traffic graphs to striker - https://phabricator.wikimedia.org/T149453#2753143 (10bd808)
[22:25:46] <abbe98[m]>	 Hi all,
[22:25:47] <abbe98[m]>	 As of a result of some really productive people in Norway, I believe that warper.wmflabs.org have been running out of disc space. I have reached out to the Warper admin, but is't possible to get more space quicker? The Norwegians wakes up in a few hours and it would be nice if the tool was working again by then...
[22:28:13] <paladox>	 yuvipanda ^^
[22:30:09] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: My first kubernetes + python3 + django app tutorial - https://phabricator.wikimedia.org/T149191#2753271 (10Tobias1984) I now used this for the `app.py`:   ``` import os  from django.core.wsgi import get_wsgi_application  os.environ.setdefa...
[22:38:54] <wikibugs>	 10Striker, 07Epic: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#2753308 (10bd808)
[22:40:28] <yuvipanda>	 abbe98: there isn't really much we as labs admins could do, sorry!
[22:43:28] <abbe98[m]>	 YuviPanda: Okey  thanks anyway!
[23:02:51] <wikibugs>	 10Striker, 07Epic, 05Goal: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#2753346 (10bd808)
[23:03:19] <wikibugs>	 10Striker, 06Community-Tech-Tool-Labs, 05Goal, 13Patch-For-Review, 15User-bd808: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710#2753347 (10bd808)
[23:48:47] <bd808>	 yuvipanda: there's an interesting question from tobias47n9e-c in https://phabricator.wikimedia.org/T149191#2753271
[23:49:07] <bd808>	 he's working on getting a django app running on py3 k8s
[23:49:13] <yuvipanda>	 bd808: ah, you can override uwsgi settings
[23:49:30] <yuvipanda>	 by putting an ini file in ~/uwsgi.ini
[23:49:34] <bd808>	 the static file path is the interesting part
[23:49:41] <yuvipanda>	 bd808: uwsgi has the ability to serve static files
[23:49:47] <bd808>	 ah. nice
[23:49:49] <yuvipanda>	 http://uwsgi-docs.readthedocs.io/en/latest/StaticFiles.html
[23:50:49] <yuvipanda>	 uwsgi has a ton of features, and you can use 'em all
[23:51:03] <yuvipanda>	 the app.py is also not a strict requirement, you can override it
[23:51:10] <yuvipanda>	 you can basically follow any 'uwsgi + django' tutorial
[23:54:21] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: My first kubernetes + python3 + django app tutorial - https://phabricator.wikimedia.org/T149191#2753468 (10bd808) @Tobias1984 To actually serve your static content check out https://uwsgi-docs.readthedocs.io/en/latest/StaticFiles.html and...
[23:55:34] <yuvipanda>	 bd808: the code will also need to know what the base url is - from looking at the way it was trying to hit /static, I think it thinks it's running on base
[23:55:35] <yuvipanda>	 err
[23:55:36] <yuvipanda>	  on /
[23:56:27] <bd808>	 yeah. there's a Django setting to tell it where your static files are really mounted. that should be trivial once he gets them being served
[23:58:41] <yuvipanda>	 yeah
[23:58:57] <yuvipanda>	 bd808: if you put static stuff in ~/www/static, they 'll also get served from tools-static.wmflabs.org/$toolname
[23:59:12] <yuvipanda>	 but that's different origin, which might (or might not!) have implications in your code
[23:59:21] <yuvipanda>	 it's technically faster (nginx -> NFS)
[23:59:33] <yuvipanda>	 shouldn't matter unless you are like serving massive files