[00:35:40] PROBLEM - Free space - all mounts on tools-worker-1019 is CRITICAL: CRITICAL: tools.tools-worker-1019.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found)tools.tools-worker-1019.diskspace.root.byte_percentfree (<40.00%) [00:39:03] ori: ah, the fun problems of gridengine :) [00:39:48] !log tools.morebots All bots shutdown. Stashbot is now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng [00:39:49] Unknown project "tools.morebots" [00:40:06] uhh... [00:40:07] bd808: \o/ [00:40:31] what's with the unknown project message? [00:40:45] !log tools.stashbot now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng [00:40:45] Unknown project "tools.stashbot" [00:40:48] frack [00:41:07] !log striker Test [00:41:07] Unknown project "striker" [00:41:33] ldap barf [00:41:46] sad_trombone.wav [00:41:48] Exception getting LDAP data for ou=projects,dc=wikimedia,dc=org [00:42:27] At least that doesn't crash the bot anymore ;) [00:43:41] the first error was "LDAPSessionTerminatedByServerError: session terminated by server" [00:44:29] I wonder if ldap3 can recover from that or if I need to explicitly create a new connection? [00:51:12] I thought we didn't use ou=projects,dc=wikimedia,dc=org anymore [00:57:28] Krenair: that's actually quite possible. I cribbed the LDAP code from admin bot quite a long time ago [00:57:45] !log tools.stashbot Is LDAP happy yet? [00:57:46] Unknown project "tools.stashbot" [00:57:50] frack [00:58:02] we have some task, hang on [00:58:14] here: https://phabricator.wikimedia.org/T138150 [00:59:07] right. I've looked at that recently because of the puppet cron spam complaint [00:59:35] so yeah I guess I need to change how the project name validation here is done and try to make the ldap connection self-healing [01:05:42] RECOVERY - Free space - all mounts on tools-worker-1019 is OK: OK: tools.tools-worker-1019.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) [01:17:36] !log tools.stashbot Is LDAP happy yet? [01:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [01:17:51] !log tools.stashbot now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng [01:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [01:18:05] !log tools.morebots All bots shutdown. Stashbot is now handling !log messages in #countervandalism, #wikimedia-analytics, #wikimedia-fundraising, #wikimedia-labs, #wikimedia-operations, and #wikimedia-releng [01:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.morebots/SAL [01:25:35] yuvipanda: it's also 2 fewer jobs running on precise :) [01:25:45] \o/ [01:25:50] and another pod on k8s :D [01:26:11] bd808: you can accurately track your pod's memory and CPU usage now [01:26:35] I hope its low [01:26:50] its a stream based message processor after all [01:27:29] bd808: https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats?var-namespace=stashbot [01:28:02] I need to spend time on these dashboards too [01:28:04] I like the trend of this graph -- https://graphite-labs.wikimedia.org/render/?width=600&height=300&target=cactiStyle(sumSeries(tools.tools-services-01.sge.hosts.tools*12*.job_count))&target=cactiStyle(sumSeries(tools.tools-services-01.sge.hosts.tools*14*.job_count))&from=-3d [01:29:08] nice [02:10:03] 10Labs-project-Wikistats, 13Patch-For-Review: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750946 (10Dzahn) >>! In T149000#2748917, @hashar wrote: > Most probably because we list http and they have switched to https with the script not following redirects? :] That is exactly right, the... [02:11:21] 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750950 (10Dzahn) [02:12:56] Krenair: for the project entires in LDAP, it sure looks to me like OSM creates cn=$NAME,ou=projects,dc=wikimedia,dc=org [02:18:35] the roles could be managed differently. That part is delegated to Nova in the OSM code [02:20:38] 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750954 (10Dzahn) merged, deployed, ran update script table looks much better now: http://wikistats.wmflabs.org/display.php?t=w3 some of them became 404 meanwhile, and maybe there are new ones, dont know I tried usin... [02:20:45] 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2750956 (10Dzahn) 05Open>03Resolved [02:23:10] 10Labs-project-Wikistats: LXDE wiki updates broken - https://phabricator.wikimedia.org/T149396#2750958 (10Dzahn) [02:23:34] 10Labs-project-Wikistats: LXDE wiki updates broken - https://phabricator.wikimedia.org/T149396#2750973 (10Dzahn) [05:18:03] 06Labs, 10Adminbot: Get a cloak for morebots & labs-morebots - https://phabricator.wikimedia.org/T140547#2468752 (10bd808) Stashbot has taken over the `!log` duties in all channels for morebots and it does have a cloak. [05:18:19] 06Labs, 10Adminbot: Get a cloak for morebots & labs-morebots - https://phabricator.wikimedia.org/T140547#2751110 (10bd808) p:05Normal>03Low [05:52:09] 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2751134 (10RobiH) List of W3C Communities: https://www.w3.org/community/groups/ Add all that did not exist. Half of them are not wikis. By error Code, they can be deleted afterwards. [05:55:41] 10Labs-project-Wikistats: LXDE wiki updates broken - https://phabricator.wikimedia.org/T149396#2751135 (10RobiH) Same here: Change http://wiki.lxde.org/en/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5 into https://wiki.lxde.org/en/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5&form... [06:34:11] PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [07:14:11] RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0] [07:40:28] (03PS2) 10Platonides: Add an "authenticate" command for identifying with nickserv after connection [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 (https://phabricator.wikimedia.org/T149265) [07:40:55] (03CR) 10MarcoAurelio: "Glaisher, mind having a look at this? Thanks." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 (https://phabricator.wikimedia.org/T149265) (owner: 10Platonides) [07:41:29] 10Tool-Labs-tools-stewardbots, 13Patch-For-Review: StewardBot not logged into irc - https://phabricator.wikimedia.org/T149265#2751200 (10MarcoAurelio) p:05Triage>03Low [07:53:02] how come quickstatements are not listed at http://tools.wmflabs.org/?status [07:53:05] ? [07:53:22] they are hosted at tools: http://tools.wmflabs.org/wikidata-todo/quick_statements.php [07:55:42] It's hosted at wikidata-todo/ which is listed [08:02:51] 10Tool-Labs-tools-stewardbots: Evaluate cleanup on StewardBot's code - https://phabricator.wikimedia.org/T149404#2751215 (10MarcoAurelio) [08:02:55] 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2751228 (10hashar) @Dzahn apparently the wikistat script has support to fix such URLs! Reading `usr/lib/wikistats/update.php` one might be able to just: update.php w3 fixit That get from the database all the url... [08:05:01] 10Tool-Labs-tools-stewardbots: General update of HTML and CSS for stewardbot tools and portals - https://phabricator.wikimedia.org/T128745#2751233 (10MarcoAurelio) 05Open>03Resolved a:03MarcoAurelio Mainly done. I'll try to fix other minor issues when they appear. [08:05:04] 06Labs, 10Tool-Labs: Improve outcome of http://tools.wmflabs.org/?status proposal - https://phabricator.wikimedia.org/T149405#2751237 (10Wesalius) [08:05:34] abbe98[m]: thanks [08:06:49] 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config, 13Patch-For-Review: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751251 (10MarcoAurelio) Okay, so, to summarize, we have PHP lint tests running. We still lack (IMO): * python checks * and maybe... [08:07:04] 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751252 (10MarcoAurelio) [09:34:29] Hi! Running a script on the labs instance dwl I get the error '35 SSL connect error. The SSL handshaking failed.' so about one time the hour. What is the reason why? [09:41:57] jynus: can you say something to this ^ [09:47:09] valhallasw`vecto: Hi! Running a script on the labs instance dwl I get the error '35 SSL connect error. The SSL handshaking failed.' so about one time the hour. What is the reason why? [09:49:24] Doctaxon: connecting to where? Does curl/wger work? Is it just from that host or from all labs hosts or also from home, etc.... [09:50:42] connecting to where? Ahm, API [09:51:03] sorry, i am not very familiar with that [10:00:03] valhallasw`vecto: can you tell me first, what an error that is? SSL handshaking? [10:09:27] valhallasw`vecto: is it possible that the session key gets lost or changes during data transfer between client and server? [10:18:18] > Puppet is failing to run on the "dumps-stats.dumps.eqiad.wmflabs" instance in Wikimedia Labs. [10:22:14] (03PS1) 10Gehel: fixed default configuration for maps / postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/318516 [10:35:29] 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751559 (10hashar) The CI definition in integration/config.git zuul/layout.yaml is: ``` - name: labs/tools/stewardbots template: - name: comp... [10:39:27] (03CR) 10Gehel: [C: 032 V: 032] fixed default configuration for maps / postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/318516 (owner: 10Gehel) [10:41:06] Doctaxon: i don't know? Something with ssl not being able to set up the connection [10:41:51] So either something with your vm, something with the network or something with the server you're connecting to... [10:42:06] the connection is set up, but it breaks after an hour or two [10:42:42] i am connecting to the api servers [10:42:52] Handshake is the initial phase of the connection [10:43:13] So it's not set up (anymore?) at that point [10:43:34] yes, but I get the error during the connection, it breaks with this error report [10:44:24] So apparently you're reconnecting and that fails? [10:44:46] making an api interrogation [10:47:33] the scripts makes some api interrogations to get some data, and then immediately the error 35 occurs and the script breaks [10:57:35] So debug what your http stack is doing...? [11:36:31] (03PS1) 10Hashar: Introduce tox + flake8 [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318521 (https://phabricator.wikimedia.org/T128503) [11:36:52] (03CR) 10Hashar: "check experimental" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318521 (https://phabricator.wikimedia.org/T128503) (owner: 10Hashar) [11:37:59] 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config, 13Patch-For-Review: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2751642 (10hashar) 05stalled>03Open https://gerrit.wikimedia.org/r/318521 is a first pass and should be a good base to build up... [12:45:55] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [12:53:36] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [13:42:52] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2751913 (10jcrespo) [13:43:44] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2465396 (10jcrespo) [13:43:47] 06Labs, 10Labs-Infrastructure, 10DBA: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2751915 (10jcrespo) [13:44:34] 06Labs, 10Labs-Infrastructure, 10DBA, 07Epic, 07Tracking: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788#2751919 (10jcrespo) [13:44:37] 06Labs, 10Labs-Infrastructure, 10DBA: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2546917 (10jcrespo) [14:13:55] 06Labs, 10Horizon, 13Patch-For-Review: Add a 'remember me' feature to Horizon - https://phabricator.wikimedia.org/T149036#2752010 (10Andrew) 05Open>03Resolved [14:14:25] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 200.52 ms [14:17:18] 06Labs, 10Labs-Infrastructure, 10DBA: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2752030 (10jcrespo) [14:17:43] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [14:26:12] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [14:28:30] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [14:34:37] 06Labs: Request increased quota for services-test labs project - https://phabricator.wikimedia.org/T148869#2752062 (10Eevans) >>! In T148869#2738712, @Andrew wrote: > OK, quotas are increased. Please make a note on this ticket when you're done debugging and I'll move the quotas back to their previous values: >... [14:43:10] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Reinhard Kraasch was created, changed by Reinhard Kraasch link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Reinhard_Kraasch edit summary: Created page with "{{Tools Access Request |Justification=I want to migrate some of my bot tasks to tool labs ([[:de:User:RKBot]] and [[:commons:User:RKBot]]) |Completed=false |User Name=Reinha..." [15:06:27] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Reinhard Kraasch was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=936626 edit summary: [15:23:25] RECOVERY - Free space - all mounts on tools-docker-builder-01 is OK: OK: tools.tools-docker-builder-01.diskspace.root.byte_percentfree (More than half of the datapoints are undefined) tools.tools-docker-builder-01.diskspace._srv.byte_percentfree (More than half of the datapoints are undefined) [15:23:36] 06Labs, 06Operations, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2752201 (10chasemp) @madhuvishy thoughts on truncating the disposal >10G files and kicking off an update of rsync over the weekend w/ the tree largest excluded for no... [15:25:07] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:25:08] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:25:16] name resolution is not working on tools, in case it's not known [15:25:17] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:25:17] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:25:18] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:25:18] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:26:30] PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:26:30] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [15:26:31] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [15:26:39] PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [15:26:41] PROBLEM - Puppet run on tools-docker-builder-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:26:47] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [15:26:48] PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [15:26:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [15:26:51] PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:26:52] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [15:26:55] PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [15:26:57] PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [15:27:00] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:27:00] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:27:00] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:27:02] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:27:06] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [15:27:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:27:06] PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:27:07] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [15:27:11] ori: I think everything is expected to be messed up while the reboots are ongoing [15:27:12] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [15:27:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:27:16] PROBLEM - Host ToolLabs is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [15:27:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [15:27:38] PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [15:28:37] madhuvishy: can you change topic here for maint? [15:28:39] I can't seem to tod it [15:28:45] to do it even [15:28:53] bah. need to +o [15:29:36] sorry, but shinken-wm missed the internal-server-nat.wmflabs.org hostmask, and 208.80.155.255 instead, so Sigyn was triggered [15:30:23] PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [15:30:35] thanks madhuvishy [15:30:48] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:31:14] PROBLEM - Puppet run on tools-worker-1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:31:15] PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:31:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:31:23] PROBLEM - Puppet run on tools-logs-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:31:25] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [15:31:25] PROBLEM - Puppet run on tools-static-10 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [15:31:35] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:31:37] PROBLEM - Puppet run on tools-exec-1419 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:31:39] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:31:41] ori: in theory everything is ok now, tho we aren't out of maint period as defined yet just fyi [15:31:41] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:31:42] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:32:14] PROBLEM - Puppet run on tools-worker-1015 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:32:22] PROBLEM - Puppet run on tools-exec-1202 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [15:32:29] PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:32:29] PROBLEM - Puppet run on tools-worker-1014 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [15:32:37] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [15:32:37] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:32:40] PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:32:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:33:04] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:33:04] PROBLEM - Puppet run on tools-exec-1413 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:35:44] RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [15:36:48] PROBLEM - Puppet run on tools-exec-1416 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:37:12] PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:37:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:37:23] I apparently don't have enough rights to remove the f'ing topiclock [15:38:32] like that? [15:38:35] bd808 [15:38:40] :) thanks [15:38:46] you're already opped [15:38:52] * bd808 has weak irc fu [15:39:02] so it should just be a case of /mode #wikimedia-labs -t [15:39:24] I was trying "set #wikimedia-labs topiclock off" with chanserv [15:39:33] I guess +t is not the same thing [15:40:21] I don't think topiclock is on in this channel? [15:40:39] *nod* just me being confused [15:41:17] hmm. what prompted you to look at topiclock stuff? [15:41:41] the fact that only ops could change the status [15:41:55] like I said, weak irc fu [15:42:19] oh yeah, that's controlled by +t [15:42:30] ok [15:43:15] RECOVERY - Puppet run on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:43:37] !log tools restart toolschecker service on 01 and 02 [15:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:45:43] RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:45:47] RECOVERY - Puppet run on tools-exec-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [15:46:01] RECOVERY - Puppet run on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0] [15:47:09] RECOVERY - Puppet run on tools-prometheus-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:47:13] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [15:47:31] RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0] [15:47:46] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [15:47:54] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [15:48:24] RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:10] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:16] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:20] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:34] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:51:50] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:01] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:43] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:07] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:27] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:27] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:29] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:35] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:41] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:49] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:51] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:11] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:26] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:28] RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:38] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:42] RECOVERY - Puppet run on tools-docker-builder-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:48] RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:50] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:52] RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:54] RECOVERY - Puppet run on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:58] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:04] RECOVERY - Puppet run on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:06] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [15:58:32] !log tools restart k8s master, seems to have run out of fds [15:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:00:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:21] RECOVERY - Puppet run on tools-worker-1008 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:26] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:31] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:37] RECOVERY - Puppet run on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:39] RECOVERY - Puppet run on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:49] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [16:00:49] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:13] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:17] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:39] RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:41] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:48] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:57] RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [16:01:59] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:01] RECOVERY - Puppet run on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:03] RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:03] RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:13] RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:16] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:24] RECOVERY - Puppet run on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:32] RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [16:05:58] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:14] RECOVERY - Puppet run on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:16] RECOVERY - Puppet run on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:16] RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:17] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:24] RECOVERY - Puppet run on tools-logs-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:28] RECOVERY - Puppet run on tools-static-10 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:29] RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:30] RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:46] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:49] RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:51] RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:51] RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:51] RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:55] RECOVERY - Puppet run on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:05] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:15] RECOVERY - Puppet run on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:05] RECOVERY - Puppet run on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:25] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:41] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:43] RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:13] RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:22] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:26] RECOVERY - Puppet run on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:28] RECOVERY - Puppet run on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:38] RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:38] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:38] RECOVERY - Puppet run on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:50] RECOVERY - Puppet run on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [16:13:04] RECOVERY - Puppet run on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:13:28] RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:16:26] RECOVERY - Puppet run on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [16:16:48] RECOVERY - Puppet run on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:11] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:13] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [16:41:31] 06Labs, 06Operations, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2752383 (10madhuvishy) Started another sync now after truncating the >10G error/access log files from the above comment. New command (no >10G exclusion): ``` rsync --... [18:23:42] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 07Tracking: Packages to be installed in Tool Labs Kubernetes Images (Tracking) - https://phabricator.wikimedia.org/T140110#2752647 (10yuvipanda) [18:23:44] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Install dependencies for python-lxml in python container - https://phabricator.wikimedia.org/T140117#2752644 (10yuvipanda) 05Open>03Resolved a:03yuvipanda This is done [18:24:30] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Flannel is sometimes flaky - https://phabricator.wikimedia.org/T139707#2752651 (10yuvipanda) 05Open>03Resolved a:03yuvipanda This was caused by etcd nodes dying because of T140256 - is fine now. [18:24:52] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 05Goal: Goal: Allow using k8s instead of GridEngine as a backend for webservices - https://phabricator.wikimedia.org/T129309#2752656 (10yuvipanda) 05Open>03Resolved a:03yuvipanda [19:15:35] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: python (python3 only) kubernetes image missing virtualenv command - https://phabricator.wikimedia.org/T149441#2752743 (10bd808) [19:21:11] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: python (python3 only) kubernetes image missing virtualenv command - https://phabricator.wikimedia.org/T149441#2752766 (10bd808) The `python3 -m venv` failure is an upstream bug in the Debian packages: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=732703 [19:26:33] 06Labs, 10Horizon, 13Patch-For-Review: Horizon dashboard for managing instance puppet config - https://phabricator.wikimedia.org/T91990#2752793 (10Andrew) [19:26:35] 06Labs, 10Horizon: Figure out type coercion rules for puppet parameter config in horizon UI - https://phabricator.wikimedia.org/T137835#2752791 (10Andrew) 05Open>03Resolved My tests have all produced reasonable behavior so far, so I think we're good. [19:36:49] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2752827 (10leila) @bd808 feel free to send a reminder that the survey will close in a week. You can also do this 3 days before the survey closes. [19:48:51] !log tools.stewardbots I forgot to log this earlier: Found the bot was down, started the bot [19:49:26] !log tools.stewardbots It got killed again... restarted [20:01:38] Krenair: ^ stashbot is back. The pod restarted for some reason and I had an error in the config file :/ [20:02:09] !log tools.stewardbots I forgot to log this earlier: Found the bot was down, started the bot [20:02:10] !log tools.statshbot Restarted bot that had crashed and wasn't self-starting due to syntax error in config [20:02:10] !log tools.stewardbots It got killed again... restarted [20:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [20:02:12] Unknown project "tools.statshbot" [20:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [20:02:20] !log tools.stashbot Restarted bot that had crashed and wasn't self-starting due to syntax error in config [20:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [20:06:04] !log tools restart kube-apiserver again, ran into too many open file handles [20:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:06:14] bd808: looks like it's hitting a 1024 open files limit [20:06:20] bd808: gonna raise that [20:06:43] 1024 is a tiny number of file handles for a server process [20:07:00] yeah [20:07:07] bd808: not sure where that comes from [20:07:09] so tracking it down [20:07:24] it's stock for cli processes [20:07:41] something somewhere needs a ulimit in its startup [20:15:35] !log restart prometheus service on tools-prometheus-01 to see if that wakes it up [20:15:35] Unknown project "restart" [20:15:40] !log tools restart prometheus service on tools-prometheus-01 to see if that wakes it up [20:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:16:01] nope [20:16:11] yeah I don't know enough about prometheus [20:16:18] chasemp: you want prometheus blackbox exporter [20:16:21] that's what does ssh / http checks [20:16:25] I restarted it [20:16:42] right [20:16:44] the other option [20:16:46] is that it's an iptables issue [20:16:52] and the iptables rule that opens up holes [20:16:59] for tools-prometheus-02 to hit ssh [20:17:01] is failing somehow [20:17:26] bd808: ok let's see how this goes. Raised the limit by a large number now [20:18:31] yuvipanda: both tools-prometheus nodes can ssh to a vm in tools I see reported as failing [20:18:33] telnet tools-exec-1404 22 [20:18:34] Trying 10.68.18.12... [20:18:34] Connected to tools-exec-1404.tools.eqiad.wmflabs. [20:18:39] thus confusion on my part [20:18:42] chasemp: ah, ok [20:18:43] right [20:18:50] yeah looks like blackbox exporter having a bad time [20:18:57] I'll check [20:19:05] godog: ^ fyi (blackbox exporter freakin out) [20:19:12] does it do a representative check or actually just look at ssh service? [20:19:17] is it watching x to watch y etc [20:19:31] it makes a network connection [20:19:34] and checks for the ssh headers [20:19:44] ok then yeah I'm confused [20:20:12] all things being equal it should be succeeding [20:20:17] chasemp: yeah [20:22:05] yuvipanda@tools-prometheus-01:~$ curl 'localhost:9115/probe?target=tools-worker-1011.tools.eqiad.wmflabs&module=ssh_banner' [20:22:07] returns 0 [20:22:10] should be returning 1 [20:23:35] yeah [20:23:37] root@tools-prometheus-02:~# curl 'localhost:9115/probe?target=tools-worker-1014.tools.eqiad.wmflabs&module=ssh_banner' [20:23:37] probe_duration_seconds 0.000010 [20:23:38] probe_success 0 [20:23:39] root@tools-prometheus-02:~# telnet tools-exec-1404 22 [20:23:41] Trying 10.68.18.12... [20:23:43] Connected to tools-exec-1404.tools.eqiad.wmflabs. [20:23:45] Escape character is '^]' [20:23:49] works but doesn't work [20:24:07] right [20:24:16] chasemp: hmm trying to tcpdump [20:24:32] chasemp: actually, PAWS is down, so I'm going to look into that first [20:24:35] k [20:41:50] yuvipanda: indeed, I'll take a look [20:42:17] !log tools.paws stop all user containers [20:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.paws/SAL [20:42:27] I really need to finish up my kubernetes spawner refactoring and deploy it [20:48:10] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: python (python3 only) kubernetes image missing virtualenv command - https://phabricator.wikimedia.org/T149441#2752961 (10bd808) 05Open>03Resolved a:03yuvipanda Verified that `python3 -m venv ...` works. [20:48:53] bd808: thanks for confirming :D [20:49:07] tobias47n9e-c: ^ creating a virtualenv with `python -m venv path/to/create` should work now [20:51:19] yuvipanda: looking with strace doesn't seem to even attempt to connect heh [20:51:29] :( [20:53:12] yuvipanda: seems you can bump dockers pull concurrency to help with layer management per max-concurrent-downloads (instead of sequential) [20:53:22] wondering if that isn't germane to our pull timeout issue [20:54:10] it isn't a pull timeout tho [20:54:17] it is a pod start timeout [20:54:24] caused by a pull [20:54:39] don't think that'll affect us too much [20:54:51] i should have a pre-puller done in a but :) [20:55:40] chasemp: basically, i'll have an 'update images' script that uses clush to hit the builder, build image, then pull on the workers [20:55:42] pre-puller is more savvy I think, but my idea was pod start timeouts are because of pull duration where we are capped on max layer pulling concurrency [20:55:44] better than cron [20:56:05] right that could be one cause [20:56:21] and we can probably up it since our registry isn't overloaded [20:56:28] that would also explain why we see it more as our images grow [20:56:44] but that's also helped by pre-seeding [20:56:45] they have had same number of layers tho [20:57:08] only adding packages to same layers as before [20:57:43] your thinking atm is when images are built to run something that prepopulates in the moment [20:57:49] taking that op out of line w/ actual scheduling [20:58:10] that was a question / clarifying statement :) [20:58:32] yes [20:58:37] push on buld [20:58:40] *build [20:58:46] not pull every 5 mins [20:58:53] sounds good, that probaby solves 10 different issues we could face in one stroke [20:59:02] and maximum consistency and predictability [20:59:19] yup [20:59:35] eventual consistency on this doesn' sound fun [20:59:46] chasemp: i'll merge that clush patch later today too [21:00:23] not to mention either all workers can host all base images or we have to segregate work loads anyhow so it's almost no cost [21:00:43] yuvipanda: k [21:01:02] yeah [21:01:07] exactly [21:01:33] I'm with it, just making sure I understand :) [21:01:48] yeah :) [21:01:56] this needed clush root pers [21:02:54] *perms [21:14:00] yuvipanda: I have to afk for an errand for 45min or so, it seems like an issue with the binary itself though [21:14:13] :( [21:14:15] probably a good occasion to update it too, upstream has been developing more features [21:14:21] not sure why it just popped up [21:14:22] yeah [21:14:39] my wild guess is not restarter after the latest upgrade? [21:14:41] bbl [21:18:15] bd808: thanks. I got the venv to work. I am still working on the deployment. Hope to get it working this weekend and then I will update some of the docs. [21:18:43] tobias47n9e-c: awesome :) [21:32:11] curious, does labs track how often pages are being assessed, or would I need to implement my own tracking in case I would like to keep track? [21:33:20] Tool Labs question: how do I activate the web server? Do I need to create a public_html folder? [21:33:22] dennyvrandecic_: we only have bulk measurements right now. No tracking of specific URLs [21:33:42] how bulky is bulk measurement? [21:33:44] hare: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web [21:33:53] thank you [21:34:06] dennyvrandecic_: really bulk. http response codes per unit time [21:34:26] ah, so not even per project? gotcha, thanks [21:34:29] https://grafana-labs.wikimedia.org/dashboard/db/tools-activity [21:35:24] I would like to get something that tracks "popularity" by tool setup, but I have a lot of other things to get done that seem more important right now [21:36:16] like the "popcon" package in Debian? [21:36:26] graphite will be a poor match for storing that sort of timeseries data because of the number of tools [21:37:43] dennyvrandecic_: bd808 it is per tool [21:37:56] yuvipanda: it is? [21:38:10] bd808: yeah https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats [21:38:22] ah for k8s [21:38:32] bd808: so that's a misnomer [21:38:38] bd808: the web request tracking works for all tools [21:38:48] bd808: it's just not exposed on a dashboard anywhere, I guess [21:38:53] so might as well not exist, heh [21:38:55] dennyvrandecic_: is your web tool php? [21:39:03] yuvipanda: plain html [21:39:13] dennyvrandecic_: what is its name? [21:39:13] neat. I only knew about the tools.reqstats.combined.* metrics [21:39:19] with some js [21:39:35] bd808: yeah, there's per tool stuff. that isn't http code aggregated tho, just overall counts [21:39:47] that's fine [21:39:50] *nod* that would be fine really [21:39:53] dennyvrandecic_: what's the name of the tool? :) [21:39:59] everythingisconnected [21:40:05] moment [21:40:53] !log tools.everythingisconnected move to kubernetes, easier stats dashboard [21:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.everythingisconnected/SAL [21:41:08] dennyvrandecic_: it should show up in the dashboard dropdown at https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats shortly [21:41:23] yuvipanda: that's awesome! thank you! [21:41:43] !log tools.everythingisconnected move accomplished via webservice stop && webservice --backend=kubernetes start, which works for plain html / js (static) and php web applications [21:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.everythingisconnected/SAL [21:42:13] dennyvrandecic_: you will need to remember to use `webservice --backend=kubernetes start` when and if you restart it [21:42:28] ok, but for now I don't have to restart it? [21:42:35] bd808: not necessarily. we drop a bit in service.manifest that takes care of that [21:42:36] I'll write that down :) [21:42:59] yuvipanda: but webservice stops rm's the manifest does it not? [21:43:06] bd808: ah yes, but restart doesn't [21:43:12] * bd808 things we talked about this on a bug [21:43:26] bd808: so if you do webservice stop for some reason, yes you'll have to do --backend=kubernetes [21:43:32] bd808: yeah, I remember. [21:43:49] dennyvrandecic_: but yeah, you shouldn't need to do even restarts - it's a plain html / js (static) app, so should be fine :) [21:44:00] bd808: so many things to implement, etc :( [21:44:15] cool! thank you [21:44:22] I'll wait for it to appear in the dashboard :) [21:44:33] dennyvrandecic_: :) [21:44:44] bd808: we should also maybe make a only-webrequests dashboard [21:44:59] shouldn't be too hard, it's all in graphite [21:44:59] as someone who is very used to `x restart` not actually working I'm kind of hard wired to do stop && start instead. I know that I should get over that for k8s because it does fancy things with restart [21:45:26] yuvipanda: we should. and really I'd like to wire that into striker [21:45:52] bd808: yeah, you can just make a json request to graphite with tool name and get back data :D [21:46:13] (brb) [21:57:25] 10Striker: Add webservice traffic graphs to striker - https://phabricator.wikimedia.org/T149453#2753143 (10bd808) [22:25:46] Hi all, [22:25:47] As of a result of some really productive people in Norway, I believe that warper.wmflabs.org have been running out of disc space. I have reached out to the Warper admin, but is't possible to get more space quicker? The Norwegians wakes up in a few hours and it would be nice if the tool was working again by then... [22:28:13] yuvipanda ^^ [22:30:09] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: My first kubernetes + python3 + django app tutorial - https://phabricator.wikimedia.org/T149191#2753271 (10Tobias1984) I now used this for the `app.py`: ``` import os from django.core.wsgi import get_wsgi_application os.environ.setdefa... [22:38:54] 10Striker, 07Epic: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#2753308 (10bd808) [22:40:28] abbe98: there isn't really much we as labs admins could do, sorry! [22:43:28] YuviPanda: Okey thanks anyway! [23:02:51] 10Striker, 07Epic, 05Goal: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#2753346 (10bd808) [23:03:19] 10Striker, 06Community-Tech-Tool-Labs, 05Goal, 13Patch-For-Review, 15User-bd808: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710#2753347 (10bd808) [23:48:47] yuvipanda: there's an interesting question from tobias47n9e-c in https://phabricator.wikimedia.org/T149191#2753271 [23:49:07] he's working on getting a django app running on py3 k8s [23:49:13] bd808: ah, you can override uwsgi settings [23:49:30] by putting an ini file in ~/uwsgi.ini [23:49:34] the static file path is the interesting part [23:49:41] bd808: uwsgi has the ability to serve static files [23:49:47] ah. nice [23:49:49] http://uwsgi-docs.readthedocs.io/en/latest/StaticFiles.html [23:50:49] uwsgi has a ton of features, and you can use 'em all [23:51:03] the app.py is also not a strict requirement, you can override it [23:51:10] you can basically follow any 'uwsgi + django' tutorial [23:54:21] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: My first kubernetes + python3 + django app tutorial - https://phabricator.wikimedia.org/T149191#2753468 (10bd808) @Tobias1984 To actually serve your static content check out https://uwsgi-docs.readthedocs.io/en/latest/StaticFiles.html and... [23:55:34] bd808: the code will also need to know what the base url is - from looking at the way it was trying to hit /static, I think it thinks it's running on base [23:55:35] err [23:55:36] on / [23:56:27] yeah. there's a Django setting to tell it where your static files are really mounted. that should be trivial once he gets them being served [23:58:41] yeah [23:58:57] bd808: if you put static stuff in ~/www/static, they 'll also get served from tools-static.wmflabs.org/$toolname [23:59:12] but that's different origin, which might (or might not!) have implications in your code [23:59:21] it's technically faster (nginx -> NFS) [23:59:33] shouldn't matter unless you are like serving massive files