[00:40:56] 06Labs, 10Tool-Labs: Create sqldump script - https://phabricator.wikimedia.org/T151680#2824464 (10yuvipanda) Could we just rename replica.my.cnf to .my.cnf instead and provide symlinks? :) [00:49:59] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [03:39:57] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2826204 (10bd808) [03:41:16] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2689440 (10bd808) 05Open>03Resolved Results are at https://meta.wikimedia.org/wiki/Research:Annual_Tool_Labs_Survey/2016. The talk page at https://meta.wikimedia.org/wik... [04:03:15] 06Labs, 10Tool-Labs: Freenode sometimes throttles bot connections from tools - https://phabricator.wikimedia.org/T151704#2825129 (10Peachey88) One of the WMFGCs for IRC: ```lang=irc If someone tells me what IPs or range need a whitelist I can email the folks who deal with that at freenode. ``` [05:07:43] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [05:49:45] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [06:26:00] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [07:06:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:12:40] RECOVERY - Free space - all mounts on tools-docker-registry-01 is OK: OK: All targets OK [07:13:59] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2826343 (10Marostegui) The transfer of s3 has started, from db1044 to db1095. [09:00:09] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607#2826415 (10Marostegui) MariaDB replied with a solution that works, however they admit that it is weird that an `ALTER TABLE force` doesn't work: https://jira... [09:02:27] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607#2826417 (10Marostegui) The above comment obviously only works on `10.1` [09:50:22] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:25:21] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [10:31:07] 06Labs, 10DBA, 06Operations: fstrim: Operation not supported on Labs DBs - https://phabricator.wikimedia.org/T151746#2826574 (10Volans) [11:22:55] 06Labs, 10DBA: Prepare and check storage layer for the future private wiki arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T151752#2826725 (10MarcoAurelio) [11:23:50] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2826738 (10Marostegui) - data transferred - ran mysql_upgrade on it all good - replication is... [11:48:06] 06Labs, 10DBA: Prepare and check storage layer for new fi.wikivoyage.org - https://phabricator.wikimedia.org/T151756#2826837 (10MarcoAurelio) [11:49:00] 06Labs, 10DBA: Prepare and check storage layer for new fi.wikivoyage.org - https://phabricator.wikimedia.org/T151756#2826852 (10MarcoAurelio) Task created as per [[ https://wikitech.wikimedia.org/wiki/Add_a_wiki#Start | instructions ]] on Wikitech. [11:56:41] 06Labs, 10DBA: Prepare and check storage layer for the future private wiki arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T151752#2826870 (10MarcoAurelio) #Patch-for-review: https://gerrit.wikimedia.org/r/#/c/323814/2 - Added `arbcom_cswiki` to `$private_wikis` in `realm.pp`. [12:04:43] 06Labs, 10DBA, 15User-Urbanecm: Prepare and check storage layer for the future private wiki arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T151752#2826897 (10Urbanecm) [12:35:52] 06Labs, 10DBA, 15User-Urbanecm: Prepare and check storage layer for the future private wiki arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T151752#2826966 (10jcrespo) a:03jcrespo [13:15:20] 06Labs: Password for root reset on each Puppet run for Labs instances that do not have their own puppetmaster - https://phabricator.wikimedia.org/T151760#2827032 (10scfc) [13:33:02] 06Labs: Password for root reset on each Puppet run for Labs instances that do not have their own puppetmaster - https://phabricator.wikimedia.org/T151760#2827062 (10scfc) p:05Triage>03High a:03scfc [14:02:52] 06Labs, 10Tool-Labs: Restore replica.my.cnf for toolsbeta.admin - https://phabricator.wikimedia.org/T109807#1559682 (10chasemp) The short answer is: despite the presence of a "project" option there is no multi-tenancy thinking at all. I made some notes on this mechanism [[ https://phabricator.wikimedia.org/T1... [14:03:50] 06Labs, 10Tool-Labs: error starting webservice - https://phabricator.wikimedia.org/T142932#2551666 (10chasemp) If that is the case it seems this should affect jessie as readily as trusty? [14:07:08] 06Labs, 10Tool-Labs: Create sqldump script - https://phabricator.wikimedia.org/T151680#2824464 (10chasemp) This can get somewhat dangerous (volatile and impacting on other users) when done onto NFS depending on the size of the user DB. Maybe we could restrict this to output to `scratch` or something. [14:12:37] 06Labs: Request creation of etytree labs project - https://phabricator.wikimedia.org/T151762#2827131 (10Epantaleo) [14:30:51] 06Labs, 10Tool-Labs: Restore replica.my.cnf for toolsbeta.admin - https://phabricator.wikimedia.org/T109807#2827177 (10scfc) Perhaps I'm missing something here. I did not ask for `create-dbusers` to run continuously to create `replica.my.cnf` for any new service group in the Toolsbeta project. I asked: > Wh... [14:32:57] 06Labs, 10Tool-Labs: Restore replica.my.cnf for toolsbeta.admin - https://phabricator.wikimedia.org/T109807#2827184 (10scfc) Ah, okay, you //removed// `--once` in 7a9a70c541f2a1d44c23d2e7086b540b5224cce4. "* Interval defaults to 0 instead of a --once option" probably means then `--interval 0` or something lik... [14:46:20] 06Labs, 10Tool-Labs: Restore replica.my.cnf for toolsbeta.admin - https://phabricator.wikimedia.org/T109807#2827221 (10chasemp) >>! In T109807#2827177, @scfc wrote: > Perhaps I'm missing something here. I did not ask for `create-dbusers` to run continuously to create `replica.my.cnf` for any new service group... [14:48:20] 06Labs, 10Tool-Labs: error starting webservice - https://phabricator.wikimedia.org/T142932#2827224 (10scfc) There are no webgrid hosts running Jessie, so this issue can only occur on Trusty instances. [14:57:20] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2827253 (10Andrew) [14:57:22] 06Labs, 10Tool-Labs: Freenode sometimes throttles bot connections from tools - https://phabricator.wikimedia.org/T151704#2827254 (10Andrew) [14:59:07] 06Labs, 10DBA, 15User-Urbanecm: Prepare and check storage layer for the future private wiki arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T151752#2827258 (10MarcoAurelio) I guess we can remove #Labs here as it should not replicate in Labs? It's a private wiki after all. [15:08:03] 06Labs, 10Tool-Labs: error starting webservice - https://phabricator.wikimedia.org/T142932#2827290 (10chasemp) Sure, I was thinking of webservices running in k8s land. We had some issues a week ago that would fit the failure model that included containerized services. [15:20:33] 06Labs, 06Community-Tech, 10DBA, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2827293 (10chasemp) 05Open>03Resolved This has been created. (let me know otherwise) [15:29:53] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2827310 (10MoritzMuehlenhoff) >>! In T140256#2825802, @hashar wrote: > I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hos... [15:47:26] 06Labs, 10Tool-Labs: error starting webservice - https://phabricator.wikimedia.org/T142932#2827347 (10scfc) Sorry, I don't know enough (yet) about Kubernetes to answer that. [15:55:29] 06Labs, 13Patch-For-Review: Password for root reset on each Puppet run for Labs instances that do not have their own puppetmaster - https://phabricator.wikimedia.org/T151760#2827364 (10scfc) 05Open>03Resolved [16:07:05] PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:25:11] 06Labs: Request creation of etytree labs project - https://phabricator.wikimedia.org/T151762#2827131 (10chasemp) +1, good luck. Let us know how this is working out. [16:38:18] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2827470 (10jcrespo) [16:38:21] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2827467 (10jcrespo) 05Open>03Resolved a:03Marostegui I would say the testing is done, let's... [16:47:04] RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0] [17:44:34] 06Labs: Request creation of etytree labs project - https://phabricator.wikimedia.org/T151762#2827761 (10Andrew) 05Open>03Resolved @Epantaleo, the project now exists, with you as a projectadmin. You can add new members or projectadmins as you see fit. [17:44:36] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2827763 (10Andrew) [18:05:45] 06Labs, 10Tool-Labs: Freenode sometimes throttles bot connections from tools - https://phabricator.wikimedia.org/T151704#2827832 (10Andrew) We're going to fix it both ways. I added floating IPs to the remaining 10 exec nodes. I've also emailed ilines@freenode.net to ask for a lift on the connection limit. I... [18:12:13] 06Labs, 10Tool-Labs: Freenode sometimes throttles bot connections from tools - https://phabricator.wikimedia.org/T151704#2825129 (10Krenair) With most of the IPs in the labs public /25 it's at least possible to determine which underlying instance is the source. I would expect the .255 NAT IP to be hard to get... [18:54:39] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2828133 (10Marostegui) S3 and S1 are now replicating in db1095. There was some issues when rep... [19:19:03] 06Labs, 10Labs-Infrastructure, 10DBA, 10Datasets-General-or-Unknown, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2828236 (10jcrespo) Basically, the heartbeat table is shared between shards, so the replace co... [19:31:01] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 6.99 ms [19:37:42] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:40:58] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [19:44:07] 06Labs, 10Tool-Labs, 06WMF-Legal: Install unrar on Tool Labs - https://phabricator.wikimedia.org/T151794#2828377 (10Dispenser) [19:52:09] 06Labs, 10Tool-Labs, 13Patch-For-Review: error starting webservice - https://phabricator.wikimedia.org/T142932#2828424 (10scfc) >>! In T142932#2562563, @valhallasw wrote: > […] > Although the //times// differ, the change is consistently during the first puppet run in the `/var/log/puppet.log.3.gz` file. Is l... [20:01:39] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [20:03:46] 06Labs, 10Tool-Labs, 06WMF-Legal: Install unrar on Tool Labs - https://phabricator.wikimedia.org/T151794#2828377 (10Platonides) It has been a long time since I last looked at that problem, but… isn't the free unrar enough for detecting that it is indeed a rar file and not a random file containing `Rar!` ? [20:06:15] (03PS3) 10Rush: www: guard against missing $xjob->JB_hard_resource_list->qstat_l_requests [labs/toollabs] - 10https://gerrit.wikimedia.org/r/322605 (owner: 10BryanDavis) [20:08:18] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [20:13:02] (03CR) 10Rush: [C: 032 V: 032] www: guard against missing $xjob->JB_hard_resource_list->qstat_l_requests [labs/toollabs] - 10https://gerrit.wikimedia.org/r/322605 (owner: 10BryanDavis) [20:28:15] Urbanecm: Around? [20:30:16] For a while yes. [20:30:33] Amir1, ^ [20:30:56] Urbanecm: I'm not a member of templatetiger [20:31:08] if it's the matter about checkwiki tool I might be able to help [20:31:13] Amir1, but the files are owned by checkwiki. [20:32:32] Amir1, and I see you chmoded them. Thanks very much for it! [20:32:39] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:32:45] Yes, I just chmoded [20:32:48] yw [20:33:03] That was what I wanted the whole time :). [20:34:50] I just realized the files are empty! Okay, let's start getting relevant info again... Maybe I should learn dump-parsing and it'll be faster... [21:34:13] 06Labs, 10Tool-Labs, 06WMF-Legal: Install unrar on Tool Labs - https://phabricator.wikimedia.org/T151794#2828987 (10Dispenser) The Free unrar can only decompress archives created by RAR versions prior to 2.9 (2002). All hidden archives I've found were using the new format. [21:39:07] 06Labs, 10Tool-Labs: Grid jobs often stuck after Tool Labs maintenance - https://phabricator.wikimedia.org/T151603#2822434 (10chasemp) Is there a way to identify a job as "ghost process" in a generic way? [21:46:04] 06Labs, 10Labs-Infrastructure, 10Icinga, 10Monitoring, 06Operations: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2829025 (10Dzahn) Yes, it should be fine to re-enable these. The whole ticket was about not sending SMS, that is about the special "sms" conta... [21:52:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:54:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:55:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:55:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:55:34] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [21:55:48] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:56:46] 06Labs, 10Labs-Infrastructure, 10DBA: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2829103 (10chasemp) @jcrespo how do you feel about setting limits via http://dev.mysql.com/d... [21:57:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:59:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:59:29] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:01:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:08:52] 06Labs, 10Labs-Infrastructure, 10Icinga, 10Monitoring, 06Operations: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2829166 (10Dzahn) How to check directly on einsteinium in the actually genereated results which checks are paging via the "sms" contact group.... [22:12:38] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [22:14:18] 06Labs, 10Labs-Infrastructure, 10Icinga, 10Monitoring, 06Operations: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2829180 (10Dzahn) proof that the same service "nova_conductor" gets different contact_groups whether it's on a "test" host or not: ``` [eins... [22:34:36] RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [22:35:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [22:35:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [22:35:32] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [22:35:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [22:36:16] RECOVERY - Puppet run on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [22:37:36] RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [22:39:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [22:39:28] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0]