[00:03:14] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [02:39:30] Hey [02:46:46] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2574208 (10yuvipanda) i filtered out things in ERROR or SHUTOFF state. Not sure how I missed 'precise' tho. [03:14:15] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [3600.0] [03:14:23] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [3600.0] [04:26:46] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [05:01:45] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [05:47:04] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL - No data received from host [05:47:36] hmm [05:48:37] !log tools restarted nginx on tools-proxy-01, was out of connection slots [05:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [05:52:05] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.030 second response time [06:01:49] 06Labs, 10Tool-Labs: Tune nginx config parameters for tools / labs proxies - https://phabricator.wikimedia.org/T143637#2574397 (10yuvipanda) [06:07:00] 06Labs, 10Tool-Labs: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2574412 (10yuvipanda) [06:09:11] 06Labs, 10Labs-Infrastructure, 10Tool-Labs: Write a simple script that handles failovering proxies - https://phabricator.wikimedia.org/T143639#2574425 (10yuvipanda) [06:20:28] 06Labs, 10Tool-Labs, 10crosswatch: Crosswatch sends out large amounts of error mails, crashing tools-mail - https://phabricator.wikimedia.org/T143476#2574457 (10yuvipanda) a:03Sitic [06:21:43] 06Labs, 10Tool-Labs: Python is different versions on Tool labs and Grit. How run a scripts on grid and with scheduling? - https://phabricator.wikimedia.org/T143473#2574459 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Yup! Run your jobs with `-l release=trusty` and that should work. We apologize for this... [06:22:03] 06Labs, 10Tool-Labs, 13Patch-For-Review: webservice generic: unrecognized arguments: --extra-args - https://phabricator.wikimedia.org/T143403#2574463 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Should be fixed now. [06:22:47] 06Labs, 10Tool-Labs, 13Patch-For-Review: redis diamond collector not running/reporting on tools-redis-1001/2 - https://phabricator.wikimedia.org/T142735#2544920 (10yuvipanda) @valhallasw is this ok now? Can this be closed? [06:23:00] 06Labs, 10Tool-Labs, 13Patch-For-Review: Move all of tool labs to project puppetmaster - https://phabricator.wikimedia.org/T142452#2574468 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done! [06:24:14] 06Labs, 10Tool-Labs, 10grrrit-wm: Update npm to 2.x on tools - https://phabricator.wikimedia.org/T142253#2574471 (10yuvipanda) 05Open>03declined I'm going to decline this right now for tools overall, since we don't really want to install things outside of debian packaged setup. If there are specific bugs... [06:24:20] PROBLEM - Puppet run on tools-worker-1008 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:24:44] 06Labs, 10Tool-Labs, 07Tracking: Tool Labs users missing replica.my.cnf (tracking) - https://phabricator.wikimedia.org/T135931#2574477 (10yuvipanda) [06:24:46] 06Labs, 10Tool-Labs: tools.osmlint missing replica.my.cnf - https://phabricator.wikimedia.org/T142130#2574474 (10yuvipanda) 05Open>03Resolved a:03yuvipanda I see it there now. [06:25:06] 06Labs, 10Tool-Labs: tools-docker-builder-03 hanging - https://phabricator.wikimedia.org/T141720#2574479 (10yuvipanda) 05Open>03Invalid Destroyed and recreated, and the original hang was documented elsewhere. [06:25:23] 06Labs, 10Labs-project-other, 10Tool-Labs, 06WMDE-Analytics-Engineering, 15User-Addshore: Add simple-json-datasource plugin to labs grafana instance - https://phabricator.wikimedia.org/T141636#2574481 (10yuvipanda) @addshore let's set a time up to merge / test this? [06:26:00] 06Labs, 10Tool-Labs: Do not truncate my precious data - https://phabricator.wikimedia.org/T141331#2574483 (10yuvipanda) 05Open>03Resolved a:03yuvipanda The new NFS setup that @chasemp and @madhuvishy are working on do not truncate any data, I believe. [06:26:20] 06Labs, 10Tool-Labs: Created tool does not show up, re-creation impossible - https://phabricator.wikimedia.org/T141178#2574487 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Fixed [06:26:49] 06Labs, 10Tool-Labs, 13Patch-For-Review: redis diamond collector not running/reporting on tools-redis-1001/2 - https://phabricator.wikimedia.org/T142735#2574492 (10valhallasw) 05Open>03Resolved a:03valhallasw Yep! [06:27:26] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Enable HTTP based service checks for k8s webservices - https://phabricator.wikimedia.org/T139157#2574495 (10yuvipanda) 05Open>03declined I'm going to not do this unless someone specifically asks for this with a solid use case. [06:28:50] 06Labs, 10Tool-Labs: LDAP entry for group project-tools refers to unknown user account gen-nfs - https://phabricator.wikimedia.org/T138181#2574500 (10yuvipanda) 05Open>03Resolved a:03yuvipanda I've removed it from LDAP. [06:29:53] 06Labs, 10Tool-Labs: jsub appears to act differently towards network requests - https://phabricator.wikimedia.org/T136588#2574507 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Indeed, adding `-l release=trusty` should run your code on trusty, resolving this issue. [06:30:56] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 10grrrit-wm: Fix grrrit-wm access situation - https://phabricator.wikimedia.org/T132828#2574510 (10yuvipanda) 05Open>03Resolved a:03yuvipanda This is completely fixed just now. @Krenair the other pod is just the webservice, safe to disregard. I can also just s... [06:31:29] 06Labs, 10Tool-Labs: Investigate kubelet container garbage collection - https://phabricator.wikimedia.org/T129730#2574513 (10yuvipanda) 05Open>03Resolved a:03yuvipanda This seems to be working as it should... [06:33:17] 06Labs, 10Tool-Labs, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, and 2 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2054159 (10yuvipanda) *bump* on this? [06:41:38] 06Labs, 10Tool-Labs, 07Upstream: Unable to explain queries on replicated databases - https://phabricator.wikimedia.org/T50875#2574537 (10yuvipanda) a:05coren>03None [06:51:47] 06Labs, 10Labs-Infrastructure, 06WMF-Legal, 07Privacy: Whitelist labs instances that need XFF header passed through the web proxy - https://phabricator.wikimedia.org/T135046#2574543 (10yuvipanda) We can easily just whitelist utrs and account-creation-assistance, and blacklist all others. Need to announce t... [06:53:46] 06Labs, 10Labs-Infrastructure, 10Labs-Sprint-102, 06DC-Ops, and 2 others: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#2574544 (10yuvipanda) Did this happen? Does this still need to happen with the new labstore stuff that @chasemp / @m... [06:55:17] 06Labs, 10Labs-Infrastructure: Add nagios check to ensure global nfs shares are shared properly from labstore1-4 - https://phabricator.wikimedia.org/T45309#2574548 (10yuvipanda) 05Open>03Invalid [06:56:47] 06Labs, 10Labs-Infrastructure: puppet::self broken - https://phabricator.wikimedia.org/T128930#2574551 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Think not? Also a restart of puppetmaster probably fixed it... Re-open if it is still an issue? [06:59:17] RECOVERY - Puppet run on tools-worker-1008 is OK: OK: Less than 1.00% above the threshold [0.0] [07:04:37] 06Labs, 10Labs-Infrastructure: puppet::self broken - https://phabricator.wikimedia.org/T128930#2090480 (10MoritzMuehlenhoff) I ran into the same problem yesterday when I setup a self-hosted puppetmaster. A restart fixed it, but since that's also documented like that on https://wikitech.wikimedia.org/wiki/Help:... [07:08:51] !log tools Enabled puppet across tools after merging https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896) [07:08:52] T134896: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896 [07:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [08:05:12] PROBLEM - Puppet staleness on tools-exec-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [09:48:20] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [09:53:46] PROBLEM - Puppet staleness on tools-exec-1207 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [09:56:41] PROBLEM - Puppet staleness on tools-exec-1404 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [43200.0] [09:57:57] PROBLEM - Puppet staleness on tools-exec-1405 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [09:58:05] PROBLEM - Puppet staleness on tools-exec-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [09:58:59] PROBLEM - Puppet staleness on tools-exec-1203 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:01:59] PROBLEM - Puppet staleness on tools-exec-1402 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [43200.0] [10:03:17] PROBLEM - Puppet staleness on tools-exec-1403 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:04:45] PROBLEM - Puppet staleness on tools-exec-1202 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:07:42] PROBLEM - Puppet staleness on tools-exec-1210 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [10:07:56] PROBLEM - Puppet staleness on tools-exec-1206 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:08:16] PROBLEM - Puppet staleness on tools-exec-1401 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [10:12:58] PROBLEM - Puppet staleness on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [43200.0] [10:13:16] PROBLEM - Puppet staleness on tools-exec-gift is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [10:13:32] PROBLEM - Puppet staleness on tools-exec-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [10:13:34] PROBLEM - Puppet staleness on tools-exec-1201 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:14:34] PROBLEM - Puppet staleness on tools-services-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:16:45] PROBLEM - Puppet staleness on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:18:53] PROBLEM - Puppet staleness on tools-exec-1208 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [10:34:26] PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:14:26] RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [11:40:56] PROBLEM - Puppet staleness on tools-exec-1211 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [11:42:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:43:40] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [11:44:58] PROBLEM - Puppet staleness on tools-exec-1213 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [11:47:42] PROBLEM - Puppet staleness on tools-precise-dev is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [12:22:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [12:23:41] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:31:13] 10Tool-Labs-tools-Other, 10Analytics, 06Community-Tech, 10Pageviews-API, and 2 others: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2575202 (10Lea_WMDE) [13:19:41] 06Labs, 10Tool-Labs: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2574412 (10chasemp) Not that I'm against a canary check, but a situation where we only have a check that spans several layers as a defacto health monitor was exhausting and low value. DNS... [13:25:08] 06Labs, 10Labs-Infrastructure, 06DC-Ops, 06Operations, and 2 others: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#2575363 (10chasemp) [13:25:11] 06Labs, 10Labs-Infrastructure, 10Labs-Sprint-102, 06DC-Ops, and 2 others: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#2575360 (10chasemp) 05Open>03Resolved a:03chasemp >>! In T101741#2574544, @yuvipanda wrote: > Did this happen?... [13:28:36] 06Labs, 10Tool-Labs: Do not truncate my precious data - https://phabricator.wikimedia.org/T141331#2575372 (10chasemp) The newer NFS setup in progress doesn't affect any of this I think. We will still be limited on space. All log cleanup that has been done has been targeted at logs 100MB and greater iirc. Th... [13:32:57] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2575391 (10chasemp) @Andrew was kind enough to write a brief script to consolidate pulling the numbers P3876 [13:57:17] 06Labs, 10Tool-Labs: bigbrother hosts missing exec packages - https://phabricator.wikimedia.org/T143458#2568904 (10chasemp) Example of a big brother host? But I vote adding exec_environ as the simple solution [14:06:35] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:09:58] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:11:14] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:17:03] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:18:47] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:25:01] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [14:31:14] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [14:33:06] RECOVERY - Puppet staleness on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:33:46] RECOVERY - Puppet staleness on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:36:40] RECOVERY - Puppet staleness on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:37:56] RECOVERY - Puppet staleness on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:38:45] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [14:38:59] RECOVERY - Puppet staleness on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:41:30] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Investigate upgrade of OpenStack python module for labnodepool1001 - https://phabricator.wikimedia.org/T143013#2575765 (10hashar) [14:41:35] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:41:57] RECOVERY - Puppet staleness on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:42:05] RECOVERY - Puppet run on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [14:43:06] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Investigate upgrade of OpenStack python module for labnodepool1001 - https://phabricator.wikimedia.org/T143013#2554094 (10hashar) I am merging this one in the other task T137217 they are close dupes [14:43:17] RECOVERY - Puppet staleness on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:43:17] RECOVERY - Puppet staleness on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:43:33] RECOVERY - Puppet staleness on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:44:32] 06Labs, 10Continuous-Integration-Infrastructure, 07Wikimedia-Incident: Investigate upgrade of OpenStack python module for labnodepool1001 - https://phabricator.wikimedia.org/T143013#2575798 (10hashar) [14:44:45] RECOVERY - Puppet staleness on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:45:46] 06Labs, 10Tool-Labs, 10crosswatch: Crosswatch sends out large amounts of error mails, crashing tools-mail - https://phabricator.wikimedia.org/T143476#2575800 (10valhallasw) This is still happening. @Sitic, can you please take a look at this ASAP? [14:47:42] RECOVERY - Puppet staleness on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:47:56] RECOVERY - Puppet staleness on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:48:34] RECOVERY - Puppet staleness on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:49:32] RECOVERY - Puppet staleness on tools-services-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:52:13] 06Labs, 10Tool-Labs: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2575812 (10chasemp) stop gap thought, paws outage to page all @labs personnel? [14:52:46] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:52:57] RECOVERY - Puppet staleness on tools-services-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:53:43] !log wikilabels u_wikilabels=> update campaign set active = 't' where id = 8; [14:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [14:54:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:54:17] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:56:46] RECOVERY - Puppet staleness on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:57:32] 06Labs, 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2575815 (10hashar) https://gerrit.wikimedia.org/r/306220 `nodepool: bump nova client and openstack CLI` should do it. That would... [14:57:58] 06Labs, 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2575822 (10hashar) p:05Low>03High [14:58:18] RECOVERY - Puppet staleness on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [3600.0] [14:58:34] 06Labs, 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2575825 (10hashar) [14:58:54] RECOVERY - Puppet staleness on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:59:38] !log wikilabels update campaign set active = 'f' where id = 8; [14:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [14:59:48] !log wikilabels update campaign set active = 't' where id = 24; [14:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [15:05:15] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [15:05:23] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [15:08:44] chasemp: just FYI, concerning my quota increasement, my plan is to delete the old VM this weekend [15:09:00] Luke081515: cool thanks, no hurry but I appreciate hte update [15:09:15] :) [15:10:14] chasemp: but my plan was right, the old machine has some issues, two weeks ago, we need 3 people to restore the ssh-login after a normal reboot [15:10:48] so I'm happy to got the new one, so I don't need to ask people for helping me, at least at that point :) [15:11:13] great to hear [15:14:15] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [15:17:13] 06Labs, 10Tool-Labs: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2575864 (10yuvipanda) How about we setup a simple endpoint in the proxy configuration itself and check that? That would catch the proxy specifically. [15:25:26] 06Labs, 10Tool-Labs: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2575905 (10chasemp) >>! In T143638#2575864, @yuvipanda wrote: > How about we setup a simple endpoint in the proxy configuration itself and check that? That would catch the proxy specificall... [15:31:26] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2575912 (10hashar) > we are doing a lot less and getting a lot more done it seems like Most of the jobs have been mov... [15:32:47] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [15:34:13] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [15:59:44] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2575937 (10hashar) Gave it a try with: * max-server 12 * jessie 8 * trusty 4 And at 11 instances, attempt to spawn a... [16:08:35] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2575961 (10hashar) Via the openstack command line tool I got 8 instances: | ef6ece68-5d79-4ba0-b2f0-e6d0fe308201 | ci-... [16:13:52] 06Labs, 10Tool-Labs: Problem with utf-8 on Grid - https://phabricator.wikimedia.org/T143691#2575981 (10Vladis13) [16:16:21] 06Labs, 10Tool-Labs: Python is different versions on Tool labs and Grit. How run a scripts on grid and with scheduling? - https://phabricator.wikimedia.org/T143473#2575997 (10Vladis13) Ok. But, it seems to me, there is a problem with the codepage utf-8 in python or console. T143691 [16:21:53] 06Labs, 10Tool-Labs: Problem with utf-8 on Grid - https://phabricator.wikimedia.org/T143691#2575981 (10valhallasw) You need to specify the i/o encoding if the output is piped to a file: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING [16:44:43] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:46:04] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2576175 (10chasemp) @hashar I'm uncomfortable with you changing these values without a notice here before hand, a SAL e... [16:46:06] 06Labs, 10Tool-Labs, 07Wikimedia-Incident: Tune nginx config parameters for tools / labs proxies - https://phabricator.wikimedia.org/T143637#2576176 (10greg) [16:46:22] 06Labs, 10Labs-Infrastructure, 10Tool-Labs, 07Wikimedia-Incident: Write a simple script that handles failovering proxies - https://phabricator.wikimedia.org/T143639#2576181 (10greg) [16:48:09] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [16:49:01] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:56:08] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [17:07:45] RECOVERY - Puppet staleness on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [3600.0] [17:09:15] RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0] [17:10:03] PROBLEM - Host andrew-launch-test-101 is DOWN: CRITICAL - Host Unreachable (10.68.21.78) [17:16:41] 10Tool-Labs-tools-Pageviews, 03Community-Tech-Sprint: Restrict Topviews to showing data only for individual days or months - https://phabricator.wikimedia.org/T142403#2576350 (10DannyH) p:05High>03Normal [17:17:21] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [17:19:34] PROBLEM - Host tools-secgroup-test-103 is DOWN: PING CRITICAL - Packet loss = 100% [17:24:40] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:30:54] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [17:34:29] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [17:42:14] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2576452 (10hashar) Asked sorry @chasemp :-( [18:05:12] RECOVERY - Puppet staleness on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:12:52] RECOVERY - Puppet run on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [18:59:27] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2576881 (10chasemp) >>! In T143016#2576452, @hashar wrote: > Asked sorry @chasemp :-( No worries, thanks for understan... [19:06:03] 06Labs, 10Tool-Labs: Problem with utf-8 on Grid - https://phabricator.wikimedia.org/T143691#2576908 (10Vladis13) @valhallasw, thanks you. I wrote `export PYTHONIOENCODING=UTF-8` in .bash_profile, and run $`source .bash_profile`. [19:09:02] chasemp: sorry for the Nodepool instances spawning experiment earlier today :( I should have asked there and at least used !log it [19:10:20] hashar: yep thanks, I made a note there as to a pretty interesting upstream commit / issue [19:13:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:16:00] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:37:21] chasemp: back from scap / mw train. I noticed a couple instances that were not showing in the api calls or in some horizon view [19:37:33] but would still show in horizon on the main page / overview of a project [19:38:01] I confirmed that both instances got started when nodepool was spamming the api / having quota issues. So there might be some glitches / race condition that prevent it to flag the deletion somehow [19:38:05] the uuid is unreacheable [19:38:12] haven't tried to ssh to the allocated ip though :( [19:53:51] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:01] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [20:09:52] 06Labs, 10Tool-Labs: Build a puppet failure check for tools that's less flaky than current one - https://phabricator.wikimedia.org/T143499#2569899 (10madhuvishy) +1. It would be awesome to schedule downtime for groups of hosts too - all k8s workers, grid exec nodes etc. [20:16:18] !log wikilabels deployed wikilabels-wmflabs-deploy:3cc8e1c [20:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [21:04:19] 06Labs, 10Phlogiston (Interrupt): Create new Phlogiston instance for production - https://phabricator.wikimedia.org/T142277#2577447 (10JAufrecht) [21:05:09] 06Labs: Revert: Request increased quota for Phlogiston labs project - https://phabricator.wikimedia.org/T143020#2577449 (10JAufrecht) I've created phlogiston-03, moved the DNS, and terminated phlogiston-01. Thanks. [21:05:17] 06Labs: Revert: Request increased quota for Phlogiston labs project - https://phabricator.wikimedia.org/T143020#2577450 (10JAufrecht) a:03chasemp [21:05:33] 06Labs, 10Phlogiston (Interrupt): Create new Phlogiston instance for production - https://phabricator.wikimedia.org/T142277#2529844 (10JAufrecht) [21:50:44] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2577565 (10chasemp) [21:50:46] 06Labs: Revert: Request increased quota for Phlogiston labs project - https://phabricator.wikimedia.org/T143020#2577562 (10chasemp) 05stalled>03Resolved >>! In T143020#2577449, @JAufrecht wrote: > I've created phlogiston-03, moved the DNS, and terminated phlogiston-01. Thanks. let me know how it works out! [21:50:48] 06Labs, 10Phlogiston (Interrupt): Create new Phlogiston instance for production - https://phabricator.wikimedia.org/T142277#2577564 (10chasemp) [22:03:44] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Public IPs not being updated from OpenStack Nova plugin - https://phabricator.wikimedia.org/T52620#2577601 (10AlexMonk-WMF) >>! In T52620#2570792, @Andrew wrote: > In theory this issue is already resolved because the status plugin gets 'exists' notifications... [22:15:02] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:35:11] (03PS1) 10Mattflaschen: Add #Augmented-Changes-Feed to #wikimedia-collaboration [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/306297 [22:50:01] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:06:08] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker, and 2 others: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2577831 (10bd808) [23:42:24] PROBLEM - Free space - all mounts on tools-docker-builder-01 is CRITICAL: CRITICAL: tools.tools-docker-builder-01.diskspace.root.byte_percentfree (<33.33%) [23:57:23] RECOVERY - Free space - all mounts on tools-docker-builder-01 is OK: OK: All targets OK [23:59:05] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Joaquinito01 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=818161 edit summary: