[00:48:09] <shinken-wm>	 RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms
[00:54:56] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 200.48 ms
[00:57:00] <shinken-wm>	 PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218)
[00:59:22] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22)
[01:05:55] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[01:10:51] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170)
[01:11:40] <yuvipanda>	 gifti ok, so your webservice works on k8s properly now, right?
[01:11:50] <gifti>	 right
[01:12:32] <yuvipanda>	 gifti \o/ ok. if you describe the bots you run and how they are run, I'll try to figure out a way to run them on k8s
[01:13:17] <gifti>	 what do you need exactly?
[01:16:01] <wikibugs>	 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2556020 (10yuvipanda) @Krenair just told me deployment-redis02 is in a similar state, and I see there has been no changes in the disk io numbers from nova for a while. It's a trusty ins...
[04:22:22] <grrrit-wm>	 (03PS1) 10BryanDavis: Add ini setting for cannonical ssl hostname [labs/striker] - 10https://gerrit.wikimedia.org/r/304965 
[04:23:22] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Add ini setting for cannonical ssl hostname [labs/striker] - 10https://gerrit.wikimedia.org/r/304965 (owner: 10BryanDavis)
[04:24:18] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add ini setting for cannonical ssl hostname [labs/striker] - 10https://gerrit.wikimedia.org/r/304965 (owner: 10BryanDavis)
[04:24:54] <grrrit-wm>	 (03PS1) 10BryanDavis: Bump Striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/304966 
[04:25:08] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Bump Striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/304966 (owner: 10BryanDavis)
[04:25:14] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump Striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/304966 (owner: 10BryanDavis)
[06:42:40] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Delete ci-trusty-wikimedia-278848 instance in contintcloud project - https://phabricator.wikimedia.org/T143058#2556217 (10Paladox) Oh sorry..
[06:59:19] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[07:34:18] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:13:52] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[09:50:18] <wikibugs>	 10Tool-Labs-tools-Pageviews: Pageviews: Fix topviews - https://phabricator.wikimedia.org/T143026#2556517 (10Aklapper)
[09:53:49] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:10:58] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management  - https://phabricator.wikimedia.org/T143016#2557175 (10chasemp)
[14:28:14] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557202 (10chasemp)
[14:29:36] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: OpenStack misreports number of instances per project - https://phabricator.wikimedia.org/T143018#2557217 (10chasemp) 05Open>03declined Some parts of this are confused by the adhoc nature of reporting on our end, the usage command is i...
[14:32:36] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557226 (10chasemp) p:05Triage>03High
[16:12:31] <SpyService>	 yuvipanda, bd808 I Wonder why this patch has +2, but is not merged? https://gerrit.wikimedia.org/r/#/c/304889/
[16:13:50] <bd808>	 oh. it's parent isn't merged yet
[16:14:00] <SpyService>	 ah
[16:14:17] * bd808 fixes that
[16:14:32] <SpyService>	 bd808: thx
[16:15:27] <bd808>	 The method of showing dependencies in the new gerrit ui isn't very intuitive 
[16:15:37] <bd808>	 "related changes" doesn
[16:15:39] <bd808>	 doesn
[16:15:47] <bd808>	 dosen't scream "parent"
[16:15:55] <SpyService>	 :O
[16:15:59] <SpyService>	 yep, I don't like that too
[16:49:03] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:49:48] <wikibugs>	 06Labs, 10Labs-Infrastructure: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#2557644 (10AlexMonk-WMF) I tried to make something similar for LDAP but was defeated by our ldap server's query results size limit
[17:21:35] <shinken-wm>	 RECOVERY - Puppet staleness on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [3600.0]
[17:22:18] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:24:07] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:33:36] <Betacommand>	 Anyone know why Im getting a Lost connection to MySQL server during query while running a SQL query?
[17:46:08] <bd808>	 Betacommand: is it a very long running query? I think there is a time limit killer at some point
[17:47:41] <tom29739>	 It was adjusted to 30 minutes recently I think.
[17:49:06] <Betacommand>	 dam, any way to run longer queries?
[17:50:38] <bd808>	 Not that I know of. Sometimes you can figure out how to batch things or otherwise optimize the query.
[17:52:30] <Betacommand>	 not possible with this query though
[17:53:26] <yuvipanda>	 I think it's at 1h
[17:53:36] <yuvipanda>	 basically gotta wait for new machines to come online
[17:54:06] <Betacommand>	 OK, do we have any guesses on the ETA?
[17:56:22] <yuvipanda>	 not yet unfortunately
[17:56:45] <yuvipanda>	 betacommand also, Lost connection could also just be the mysql connection timeout
[17:56:50] <yuvipanda>	 if you have an idle connection open for too long
[17:56:53] <yuvipanda>	 it'll run into that
[17:57:03] <yuvipanda>	 mysl on python has a .ping method to avoid this
[17:57:14] <yuvipanda>	 so if you're running into it intermittently, it's probably that
[17:57:46] <Betacommand>	 yuvipanda: was in the middle of sql enwiki_p <mostredlinkquery.sql> $HOME/public_html/reports/mostredlinked.txt
[17:58:23] <yuvipanda>	 is that sql file one query or multiple queries?
[17:58:29] <Betacommand>	 one
[17:58:48] <Betacommand>	 has subqueries but that shouldnt matter
[17:59:01] <Betacommand>	 just makes running it longer
[17:59:03] <yuvipanda>	 ok. I guess for now all I can offer is try to do less single large queries and batch them together.
[17:59:30] <yuvipanda>	 if you're using SQL to generate output in human readable form (wikitext, html, etc) try to not do that and use a programming language instead (I don't know if this script is doing that but I know there are several others on tools that do)
[17:59:31] <shinken-wm>	 PROBLEM - Puppet run on tools-redis-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[17:59:49] <yuvipanda>	 other than that, nothing I can offer right now. sorry
[18:00:56] <Betacommand>	 yuvipanda: the output part is minimal, just a monster of a query. Used to run it on the TS and was looking to revive it.
[18:01:58] <yuvipanda>	 try it without it? and otherwise, we'll email labs-l / labs-announce when the new boxes are online
[18:06:30] <Betacommand>	 yuvipanda: thanks for the suggestions, are we talking a few weeks, months or 6+?
[18:07:11] <yuvipanda>	 betacommand reaslistically, I'd say a month at least, 1.5-2 months a solid guarantee.
[18:07:23] <yuvipanda>	 not 6+ months
[18:08:03] <Betacommand>	 Im surprised, I though we already had the boxes. 
[18:08:14] <shinken-wm>	 PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[18:08:16] <yuvipanda>	 we do
[18:08:29] <yuvipanda>	 however, we're blocked on time mostly
[18:08:31] <yuvipanda>	 we're also trying to replace the NFS boxes first
[18:08:37] <yuvipanda>	 and work on that has been ongoing for a while
[18:08:40] <yuvipanda>	 so that'll happen first 
[18:08:50] <Betacommand>	 Ah
[18:09:03] <yuvipanda>	 plus we're still at only one DBA and he has to take care of prod too
[18:09:04] <yuvipanda>	 so combination of factors
[18:10:56] <grrrit-wm>	 (03PS1) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[18:12:04] <shinken-wm>	 PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[18:14:21] <wikibugs>	 06Labs, 10Labs-Infrastructure: Dump instance info as a static file updated periodically - https://phabricator.wikimedia.org/T143136#2557989 (10yuvipanda)
[18:18:29] <wikibugs>	 06Labs, 10Labs-Infrastructure: Don't rely on wikitech API for production services - https://phabricator.wikimedia.org/T104575#2558013 (10yuvipanda)
[18:18:31] <wikibugs>	 06Labs, 10Labs-Infrastructure: Dump instance info as a static file updated periodically - https://phabricator.wikimedia.org/T143136#2558012 (10yuvipanda)
[18:47:04] <shinken-wm>	 RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:04:29] <shinken-wm>	 RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:10:09] <gifti>	 yuvipanda: "if you describe the bots you run and how they are run, I'll try to figure out a way to run them on k8s" what do you need exactly?
[19:13:50] <yuvipanda>	 gifti hmm, so a list of bots you run, which account they are running from, wether they are launched via cron or continuously, exact commandline used to launch them?
[19:14:32] <grrrit-wm>	 (03PS2) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[19:16:38] <gifti>	 yuvipanda: SUL-account: GiftBot (mainly dewiki, but also other German language projects), tools.giftbot, there is a crontab and there are 5 continuous jobs, one of them java, some of the crontab jobs are php; the cont jobs have aliases in .bashrc: gva, vm, mg, sga, gvm
[19:17:26] <gifti>	 that should give you everything you asked for
[19:17:30] <tom29739>	 That'd be fairly easy to port to k8s.
[19:18:40] <grrrit-wm>	 (03PS3) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[19:21:35] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Delete ci-trusty-wikimedia-278848 instance in contintcloud project - https://phabricator.wikimedia.org/T143058#2558277 (10chasemp) 05Open>03Resolved a:03chasemp I wish I knew why that was the case but I deleted from the CLI and I s...
[19:22:42] <grrrit-wm>	 (03PS4) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[19:34:29] <yuvipanda>	 gifti does that include the jobs that run on the dedicated node?
[19:34:37] <gifti>	 yes
[19:35:04] <gifti>	 they are in jlocal dwl?.sh
[19:35:23] <yuvipanda>	 I'm going to copy this all down to https://phabricator.wikimedia.org/T99130
[19:35:32] <yuvipanda>	 the continuous jobs seem like the easiest to move first
[19:37:16] <gifti>	 yuvipanda: why does it have a exclamation mark?
[19:37:20] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[19:37:39] <wikibugs>	 06Labs, 10Tool-Labs: Investigate alternatives to dedicated exec node for gifti's tools - https://phabricator.wikimedia.org/T99130#2558358 (10yuvipanda) Copying from IRC:   > gifti: yuvipanda: SUL-account: GiftBot (mainly dewiki, but also other German language projects), tools.giftbot, there is a crontab and th...
[19:37:52] <yuvipanda>	 gifti why does what have an exclamation mark?
[19:38:06] <gifti>	 in the ticket before the title
[19:38:15] <wikibugs>	 06Labs, 10Tool-Labs: Investigate alternatives to dedicated exec node for gifti's tools - https://phabricator.wikimedia.org/T99130#2558359 (10yuvipanda) The continuous jobs seem like the easiest to move first.
[19:38:22] <gifti>	 with a yellowish color
[19:38:38] <yuvipanda>	 oh
[19:38:41] <yuvipanda>	 that just means 'priority = normal' I think
[19:38:49] <gifti>	 ah, ok
[20:39:16] <grrrit-wm>	 (03PS5) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[20:46:43] <grrrit-wm>	 (03PS6) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[20:47:35] <mafk>	 after I create a crontab, do I have to jsub/jstart the task?
[20:48:17] <valhallasw`cloud>	 mafk: no, cron starts the task at the requested time
[20:48:43] <mafk>	 ah valhallasw`cloud that's great
[20:51:54] <grrrit-wm>	 (03PS7) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[21:05:49] <mafk>	 valhallasw`cloud: crontab -e resulted in "12 04 * * *  /usr/bin/jsub -N cron-tools.mabot-1 -once -quiet redirects.sh"
[21:09:49] <grrrit-wm>	 (03PS8) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[21:11:04] <valhallasw`cloud>	 mafk: that looks ok
[21:11:23] <mafk>	 valhallasw`cloud: however the file is not on labs
[21:11:30] <mafk>	 I can't find any .crontab
[21:11:42] <mafk>	 file name is /tmp/<somecraphere>
[21:12:59] <valhallasw`cloud>	 it's on a different server, but that doesn't matter
[21:13:24] <valhallasw`cloud>	 when you save and exit the editor, the crontab is installed
[21:13:29] <valhallasw`cloud>	 and active immediately
[21:13:35] <mafk>	 and to delete it?
[21:13:45] <mafk>	 or it's always the same file name?
[21:13:56] <mafk>	 I wonder because I had some issues at first
[21:14:36] <gifti>	 delete: crontab -e and save empty file or crontab -r
[21:15:15] <mafk>	 done!
[21:15:17] <mafk>	 thanks
[21:19:52] <Platonides>	 mafk: also note you can prepend with # to comment a line
[21:20:18] <mafk>	 Platonides: that's helpful too, thanks
[21:36:36] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Blacklist channel #wikimedia-operations/debs/wikistats [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/304746 (owner: 10Paladox)
[21:44:15] <grrrit-wm>	 (03PS1) 10BryanDavis: Add connection options for MySQL backend [labs/striker] - 10https://gerrit.wikimedia.org/r/305133 
[21:45:49] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Add connection options for MySQL backend [labs/striker] - 10https://gerrit.wikimedia.org/r/305133 (owner: 10BryanDavis)
[22:00:23] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add connection options for MySQL backend [labs/striker] - 10https://gerrit.wikimedia.org/r/305133 (owner: 10BryanDavis)
[22:20:23] <grrrit-wm>	 (03PS9) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) 
[22:20:25] <grrrit-wm>	 (03PS1) 10BryanDavis: Bump Striker [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305138 
[22:23:00] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Bump Striker [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305138 (owner: 10BryanDavis)
[22:30:49] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump Striker [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305138 (owner: 10BryanDavis)
[22:31:56] <thcipriani>	 could I get someone to double check the allocation for the contintcloud project? Watching nodepool I keep seeing 403s in spite of consistantly having 6 images (rather than 10) in the project. Looking at the logs, nodepool is spending a lot of time calculating, flailing, and recalculating.
[22:32:33] <thcipriani>	 I know we're causing lots of error responses in openstack, trying to get to the bottom of it.
[22:39:28] <bd808>	 thcipriani: do you have any instances stuck in deleteing?
[22:39:43] <bd808>	 I saw that on Sunday night
[22:40:09] <thcipriani>	 bd808: not that I saw
[22:40:24] <thcipriani>	 I dropped the number of instances to 6 just now, seeing a lot less 403 exceptions in the logs
[22:40:37] <thcipriani>	 well, seeing *no* 403 exceptions in the logs
[22:41:51] <thcipriani>	 it seems like nodepool was getting backed up because it was doing a calculation of the servers it had vs. could have vs. needs attempted to launch a bunch, failed to launch that many servers for some reason, and went back to square 1
[22:42:22] <Krenair>	 hmm.. I see 7 images
[22:42:40] <Krenair>	 don't think that should be an issue though
[22:43:33] <Krenair>	 you have a glance client, right?
[22:43:34] <thcipriani>	 Krenair: where are you seeing 7?
[22:43:53] <thcipriani>	 I've just been watching with: nodepool list and openstack server list
[22:44:32] <Krenair>	 are we talking images or instances?
[22:44:53] <thcipriani>	 ah, instances, sorry, missed that in your message.
[22:45:27] <Krenair>	 nova says you have 4 instances right now
[22:46:32] <Krenair>	 339163, 339154, 339160 and 339161
[22:46:51] <thcipriani>	 yeah, doesn't look like there's too much demand currently (looking at zuul dashboard)
[22:47:32] <thcipriani>	 importantly the crazy amount of debug log exceptions has gone down(!)
[23:01:14] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559382 (10thcipriani) I dropped the `max-servers` in `/etc/nodepool/nodepool.yaml` to 6 as that seemed to be the max number of allocated ins...
[23:03:07] <Krenair>	 thcipriani, contintcloud has an instances quota of 10
[23:03:26] <Krenair>	 40 cores, 100GB RAM
[23:03:36] <Krenair>	 no floating ips
[23:03:55] <thcipriani>	 hrm. I wonder if it's hitting some other quota.
[23:04:03] <thcipriani>	 rather than number of instances
[23:04:20] <Krenair>	 I think the nova api would return a response other than 403?
[23:04:49] <thcipriani>	 I meant like for ram or cores rather than instance number
[23:05:58] <thcipriani>	 hrm, no, that'd be unlikely with m1.medium and that kind of allocation.
[23:06:12] <thcipriani>	 6 m1.mediums
[23:08:38] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559429 (10thcipriani) hrm. Maybe I stopped seeing 403s since demand was lower. Still working with 6 instances, just got:  ``` Forbidden: Quo...
[23:09:08] <Krenair>	 10 m1.mediums would be well within your quotas
[23:12:54] <Krenair>	 thcipriani, you may have to wait for and.rew to return I'm afraid
[23:13:52] <wikibugs>	 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559446 (10thcipriani) Messages like this one:  ``` DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0....
[23:14:40] <thcipriani>	 Krenair: yarp, mayhaps. Thanks for double-checking the quotas for me :)
[23:29:48] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker, 13Patch-For-Review: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2559490 (10bd808)
[23:55:25] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker, and 2 others: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2559553 (10bd808) a:03bd808