[00:48:09] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [00:54:56] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 200.48 ms [00:57:00] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [00:59:22] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [01:05:55] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [01:10:51] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [01:11:40] gifti ok, so your webservice works on k8s properly now, right? [01:11:50] right [01:12:32] gifti \o/ ok. if you describe the bots you run and how they are run, I'll try to figure out a way to run them on k8s [01:13:17] what do you need exactly? [01:16:01] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2556020 (10yuvipanda) @Krenair just told me deployment-redis02 is in a similar state, and I see there has been no changes in the disk io numbers from nova for a while. It's a trusty ins... [04:22:22] (03PS1) 10BryanDavis: Add ini setting for cannonical ssl hostname [labs/striker] - 10https://gerrit.wikimedia.org/r/304965 [04:23:22] (03CR) 10BryanDavis: [C: 032] Add ini setting for cannonical ssl hostname [labs/striker] - 10https://gerrit.wikimedia.org/r/304965 (owner: 10BryanDavis) [04:24:18] (03Merged) 10jenkins-bot: Add ini setting for cannonical ssl hostname [labs/striker] - 10https://gerrit.wikimedia.org/r/304965 (owner: 10BryanDavis) [04:24:54] (03PS1) 10BryanDavis: Bump Striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/304966 [04:25:08] (03CR) 10BryanDavis: [C: 032] Bump Striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/304966 (owner: 10BryanDavis) [04:25:14] (03Merged) 10jenkins-bot: Bump Striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/304966 (owner: 10BryanDavis) [06:42:40] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Delete ci-trusty-wikimedia-278848 instance in contintcloud project - https://phabricator.wikimedia.org/T143058#2556217 (10Paladox) Oh sorry.. [06:59:19] PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:34:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [09:13:52] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [09:50:18] 10Tool-Labs-tools-Pageviews: Pageviews: Fix topviews - https://phabricator.wikimedia.org/T143026#2556517 (10Aklapper) [09:53:49] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:10:58] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557175 (10chasemp) [14:28:14] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557202 (10chasemp) [14:29:36] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: OpenStack misreports number of instances per project - https://phabricator.wikimedia.org/T143018#2557217 (10chasemp) 05Open>03declined Some parts of this are confused by the adhoc nature of reporting on our end, the usage command is i... [14:32:36] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557226 (10chasemp) p:05Triage>03High [16:12:31] yuvipanda, bd808 I Wonder why this patch has +2, but is not merged? https://gerrit.wikimedia.org/r/#/c/304889/ [16:13:50] oh. it's parent isn't merged yet [16:14:00] ah [16:14:17] * bd808 fixes that [16:14:32] bd808: thx [16:15:27] The method of showing dependencies in the new gerrit ui isn't very intuitive [16:15:37] "related changes" doesn [16:15:39] doesn [16:15:47] dosen't scream "parent" [16:15:55] :O [16:15:59] yep, I don't like that too [16:49:03] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:49:48] 06Labs, 10Labs-Infrastructure: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#2557644 (10AlexMonk-WMF) I tried to make something similar for LDAP but was defeated by our ldap server's query results size limit [17:21:35] RECOVERY - Puppet staleness on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:22:18] RECOVERY - Puppet run on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:24:07] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [17:33:36] Anyone know why Im getting a Lost connection to MySQL server during query while running a SQL query? [17:46:08] Betacommand: is it a very long running query? I think there is a time limit killer at some point [17:47:41] It was adjusted to 30 minutes recently I think. [17:49:06] dam, any way to run longer queries? [17:50:38] Not that I know of. Sometimes you can figure out how to batch things or otherwise optimize the query. [17:52:30] not possible with this query though [17:53:26] I think it's at 1h [17:53:36] basically gotta wait for new machines to come online [17:54:06] OK, do we have any guesses on the ETA? [17:56:22] not yet unfortunately [17:56:45] betacommand also, Lost connection could also just be the mysql connection timeout [17:56:50] if you have an idle connection open for too long [17:56:53] it'll run into that [17:57:03] mysl on python has a .ping method to avoid this [17:57:14] so if you're running into it intermittently, it's probably that [17:57:46] yuvipanda: was in the middle of sql enwiki_p $HOME/public_html/reports/mostredlinked.txt [17:58:23] is that sql file one query or multiple queries? [17:58:29] one [17:58:48] has subqueries but that shouldnt matter [17:59:01] just makes running it longer [17:59:03] ok. I guess for now all I can offer is try to do less single large queries and batch them together. [17:59:30] if you're using SQL to generate output in human readable form (wikitext, html, etc) try to not do that and use a programming language instead (I don't know if this script is doing that but I know there are several others on tools that do) [17:59:31] PROBLEM - Puppet run on tools-redis-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:59:49] other than that, nothing I can offer right now. sorry [18:00:56] yuvipanda: the output part is minimal, just a monster of a query. Used to run it on the TS and was looking to revive it. [18:01:58] try it without it? and otherwise, we'll email labs-l / labs-announce when the new boxes are online [18:06:30] yuvipanda: thanks for the suggestions, are we talking a few weeks, months or 6+? [18:07:11] betacommand reaslistically, I'd say a month at least, 1.5-2 months a solid guarantee. [18:07:23] not 6+ months [18:08:03] Im surprised, I though we already had the boxes. [18:08:14] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:08:16] we do [18:08:29] however, we're blocked on time mostly [18:08:31] we're also trying to replace the NFS boxes first [18:08:37] and work on that has been ongoing for a while [18:08:40] so that'll happen first [18:08:50] Ah [18:09:03] plus we're still at only one DBA and he has to take care of prod too [18:09:04] so combination of factors [18:10:56] (03PS1) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [18:12:04] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:14:21] 06Labs, 10Labs-Infrastructure: Dump instance info as a static file updated periodically - https://phabricator.wikimedia.org/T143136#2557989 (10yuvipanda) [18:18:29] 06Labs, 10Labs-Infrastructure: Don't rely on wikitech API for production services - https://phabricator.wikimedia.org/T104575#2558013 (10yuvipanda) [18:18:31] 06Labs, 10Labs-Infrastructure: Dump instance info as a static file updated periodically - https://phabricator.wikimedia.org/T143136#2558012 (10yuvipanda) [18:47:04] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [19:04:29] RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [19:10:09] yuvipanda: "if you describe the bots you run and how they are run, I'll try to figure out a way to run them on k8s" what do you need exactly? [19:13:50] gifti hmm, so a list of bots you run, which account they are running from, wether they are launched via cron or continuously, exact commandline used to launch them? [19:14:32] (03PS2) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [19:16:38] yuvipanda: SUL-account: GiftBot (mainly dewiki, but also other German language projects), tools.giftbot, there is a crontab and there are 5 continuous jobs, one of them java, some of the crontab jobs are php; the cont jobs have aliases in .bashrc: gva, vm, mg, sga, gvm [19:17:26] that should give you everything you asked for [19:17:30] That'd be fairly easy to port to k8s. [19:18:40] (03PS3) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [19:21:35] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: Delete ci-trusty-wikimedia-278848 instance in contintcloud project - https://phabricator.wikimedia.org/T143058#2558277 (10chasemp) 05Open>03Resolved a:03chasemp I wish I knew why that was the case but I deleted from the CLI and I s... [19:22:42] (03PS4) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [19:34:29] gifti does that include the jobs that run on the dedicated node? [19:34:37] yes [19:35:04] they are in jlocal dwl?.sh [19:35:23] I'm going to copy this all down to https://phabricator.wikimedia.org/T99130 [19:35:32] the continuous jobs seem like the easiest to move first [19:37:16] yuvipanda: why does it have a exclamation mark? [19:37:20] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:37:39] 06Labs, 10Tool-Labs: Investigate alternatives to dedicated exec node for gifti's tools - https://phabricator.wikimedia.org/T99130#2558358 (10yuvipanda) Copying from IRC: > gifti: yuvipanda: SUL-account: GiftBot (mainly dewiki, but also other German language projects), tools.giftbot, there is a crontab and th... [19:37:52] gifti why does what have an exclamation mark? [19:38:06] in the ticket before the title [19:38:15] 06Labs, 10Tool-Labs: Investigate alternatives to dedicated exec node for gifti's tools - https://phabricator.wikimedia.org/T99130#2558359 (10yuvipanda) The continuous jobs seem like the easiest to move first. [19:38:22] with a yellowish color [19:38:38] oh [19:38:41] that just means 'priority = normal' I think [19:38:49] ah, ok [20:39:16] (03PS5) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [20:46:43] (03PS6) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [20:47:35] after I create a crontab, do I have to jsub/jstart the task? [20:48:17] mafk: no, cron starts the task at the requested time [20:48:43] ah valhallasw`cloud that's great [20:51:54] (03PS7) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [21:05:49] valhallasw`cloud: crontab -e resulted in "12 04 * * * /usr/bin/jsub -N cron-tools.mabot-1 -once -quiet redirects.sh" [21:09:49] (03PS8) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [21:11:04] mafk: that looks ok [21:11:23] valhallasw`cloud: however the file is not on labs [21:11:30] I can't find any .crontab [21:11:42] file name is /tmp/ [21:12:59] it's on a different server, but that doesn't matter [21:13:24] when you save and exit the editor, the crontab is installed [21:13:29] and active immediately [21:13:35] and to delete it? [21:13:45] or it's always the same file name? [21:13:56] I wonder because I had some issues at first [21:14:36] delete: crontab -e and save empty file or crontab -r [21:15:15] done! [21:15:17] thanks [21:19:52] mafk: also note you can prepend with # to comment a line [21:20:18] Platonides: that's helpful too, thanks [21:36:36] (03CR) 10Dzahn: [C: 032] Blacklist channel #wikimedia-operations/debs/wikistats [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/304746 (owner: 10Paladox) [21:44:15] (03PS1) 10BryanDavis: Add connection options for MySQL backend [labs/striker] - 10https://gerrit.wikimedia.org/r/305133 [21:45:49] (03CR) 10BryanDavis: [C: 032] Add connection options for MySQL backend [labs/striker] - 10https://gerrit.wikimedia.org/r/305133 (owner: 10BryanDavis) [22:00:23] (03Merged) 10jenkins-bot: Add connection options for MySQL backend [labs/striker] - 10https://gerrit.wikimedia.org/r/305133 (owner: 10BryanDavis) [22:20:23] (03PS9) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [22:20:25] (03PS1) 10BryanDavis: Bump Striker [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305138 [22:23:00] (03CR) 10BryanDavis: [C: 032] Bump Striker [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305138 (owner: 10BryanDavis) [22:30:49] (03Merged) 10jenkins-bot: Bump Striker [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305138 (owner: 10BryanDavis) [22:31:56] could I get someone to double check the allocation for the contintcloud project? Watching nodepool I keep seeing 403s in spite of consistantly having 6 images (rather than 10) in the project. Looking at the logs, nodepool is spending a lot of time calculating, flailing, and recalculating. [22:32:33] I know we're causing lots of error responses in openstack, trying to get to the bottom of it. [22:39:28] thcipriani: do you have any instances stuck in deleteing? [22:39:43] I saw that on Sunday night [22:40:09] bd808: not that I saw [22:40:24] I dropped the number of instances to 6 just now, seeing a lot less 403 exceptions in the logs [22:40:37] well, seeing *no* 403 exceptions in the logs [22:41:51] it seems like nodepool was getting backed up because it was doing a calculation of the servers it had vs. could have vs. needs attempted to launch a bunch, failed to launch that many servers for some reason, and went back to square 1 [22:42:22] hmm.. I see 7 images [22:42:40] don't think that should be an issue though [22:43:33] you have a glance client, right? [22:43:34] Krenair: where are you seeing 7? [22:43:53] I've just been watching with: nodepool list and openstack server list [22:44:32] are we talking images or instances? [22:44:53] ah, instances, sorry, missed that in your message. [22:45:27] nova says you have 4 instances right now [22:46:32] 339163, 339154, 339160 and 339161 [22:46:51] yeah, doesn't look like there's too much demand currently (looking at zuul dashboard) [22:47:32] importantly the crazy amount of debug log exceptions has gone down(!) [23:01:14] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559382 (10thcipriani) I dropped the `max-servers` in `/etc/nodepool/nodepool.yaml` to 6 as that seemed to be the max number of allocated ins... [23:03:07] thcipriani, contintcloud has an instances quota of 10 [23:03:26] 40 cores, 100GB RAM [23:03:36] no floating ips [23:03:55] hrm. I wonder if it's hitting some other quota. [23:04:03] rather than number of instances [23:04:20] I think the nova api would return a response other than 403? [23:04:49] I meant like for ram or cores rather than instance number [23:05:58] hrm, no, that'd be unlikely with m1.medium and that kind of allocation. [23:06:12] 6 m1.mediums [23:08:38] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559429 (10thcipriani) hrm. Maybe I stopped seeing 403s since demand was lower. Still working with 6 instances, just got: ``` Forbidden: Quo... [23:09:08] 10 m1.mediums would be well within your quotas [23:12:54] thcipriani, you may have to wait for and.rew to return I'm afraid [23:13:52] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559446 (10thcipriani) Messages like this one: ``` DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0.... [23:14:40] Krenair: yarp, mayhaps. Thanks for double-checking the quotas for me :) [23:29:48] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker, 13Patch-For-Review: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2559490 (10bd808) [23:55:25] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker, and 2 others: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2559553 (10bd808) a:03bd808