[00:41:51] (03CR) 10Paladox: [C: 032] Move around a load more logging, responses etc. [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/321338 (owner: 10Alex Monk) [00:42:43] (03Merged) 10jenkins-bot: Move around a load more logging, responses etc. [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/321338 (owner: 10Alex Monk) [00:58:31] 10Tool-Labs-tools-Pageviews: Options show/hide values on the chart - https://phabricator.wikimedia.org/T150625#2791459 (10MusikAnimal) [01:02:28] 10Tool-Labs-tools-Pageviews: Create "Userviews" tool (probably can come up with a better name) - https://phabricator.wikimedia.org/T150585#2791471 (10MusikAnimal) [04:53:08] !log tools.stashbot Restarted for wrap wiki entries in [[Template:SAL entry]] (T37876) [04:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [04:53:12] T37876: Labslogbot should auto-link git hashes and Gerrit change-ids - https://phabricator.wikimedia.org/T37876 [05:38:23] !log tools.stashbot Testing | messages | with | embedded | pipes | [05:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [05:48:02] 06Labs, 10Tool-Labs: Perl module problems on 14## exec nodes - https://phabricator.wikimedia.org/T150120#2791683 (10Beetstra) @Valhallasw: you say ".. on an older Perl version doesn't work on a newer version anymore" .. there is a newer Perl version on the 14XX hosts (and also a new PHP for that matter)? [05:51:18] 06Labs, 10Tool-Labs: Perl module problems on 14## exec nodes - https://phabricator.wikimedia.org/T150120#2774533 (10scfc) For any productive debugging, a more detailed bug report than: "I didn't change anything, and now it crashes" is needed. For example, the Perl module is (and always was?) named `BSD::Resou... [06:14:51] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [06:15:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:15:28] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:15:40] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:16:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:16:58] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:17:14] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:17:41] PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [06:17:59] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [06:18:13] ummm [06:18:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:18:21] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [06:18:29] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:18:43] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [06:19:21] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:20:15] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:20:53] i'm decently sure this is just me testing [06:21:10] yup, recoveries should come about anytime soon [06:34:49] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [06:38:28] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [06:38:42] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:40:40] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [06:41:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [06:42:14] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [06:42:40] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [06:43:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [06:44:21] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [06:45:09] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [06:45:17] RECOVERY - Puppet run on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0] [06:45:32] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [06:47:00] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [06:47:58] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:48:18] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [07:50:12] PROBLEM - Puppet staleness on tools-services-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:13:31] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [08:23:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:25:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:48:33] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [09:00:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [09:03:09] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [09:35:40] 06Labs, 10Tool-Labs-tools-Other, 10DBA: High replication activity filled up labsdb1004 with binlogs - https://phabricator.wikimedia.org/T150553#2789420 (10akosiaris) I am the one that responded (partly) to the page by increasing the space on said filesystem. That was done via LVM resizing and XFS grow filesy... [10:55:51] 06Labs, 10Tool-Labs: Perl module problems on 14## exec nodes - https://phabricator.wikimedia.org/T150120#2792041 (10Beetstra) I resolved the first three * the regex problem is a perl-problem, it has apparently been set to more strict (it is something all my bots complain about on those regexes, it is known for... [11:00:10] 06Labs, 10Tool-Labs: Perl module problems on 14## exec nodes - https://phabricator.wikimedia.org/T150120#2792057 (10Beetstra) The only other things that now changed, is that I ran cpan to install LWP' - has that changed settings that now make everything run? Or did s.o. enforce a refresh on the modules server... [11:06:41] 06Labs, 10Tool-Labs: Perl module problems on 14## exec nodes - https://phabricator.wikimedia.org/T150120#2792062 (10Beetstra) 05Open>03Resolved a:03Beetstra Now works. [12:07:29] 06Labs, 10Tool-Labs: Perl module problems on 14## exec nodes - https://phabricator.wikimedia.org/T150120#2792202 (10scfc) Great that it works now. I just want to address one thing though: > […] those are both completely new and due to the install on the 14XX nodes being different from the 12XX .. […] This i... [14:07:10] Hi, I had a database on host svwiki.labsdb named p50380g51020_perfectbot but it's been missing for some while, has it been deleted or moved? [14:09:45] fluff: it wouldn't be normal to drop a DB (without some explicit notification especially) unless there was a dire issue affecting other users, I think only the DBA's would have a good answer for you and a task in phab is probably the best bet there. [14:10:35] chasemp: alright, thanks [14:10:47] fluff, let me check [14:10:56] but chasemp is right we do not drop things at all [14:12:29] p50380g51020_perfectbot is an incorrectly-named database (it should have 2 underscores), maybe [14:12:47] it should be named like that, or permissions got mixed [14:12:58] let me check if I can find a similar named one [14:13:24] fluff, to check you own that account, could you create at ticket on phabricator? [14:13:52] this smells like a very old db name, when tools was brand new (as far i can remember). [14:14:14] yes, I will try to find it under similar names [14:14:33] Yeah, it has been there for a while [14:14:41] jynus: I'll create a ticket [14:15:43] fluff, sorry for the inconveniences, and thank you [14:17:12] 06Labs, 10Labs-Infrastructure, 10DBA, 10MediaWiki-extensions-ORES, and 3 others: Replicate ores_classification and ores_model tables in labs - https://phabricator.wikimedia.org/T148561#2792408 (10chasemp) >>! In T148561#2791429, @Halfak wrote: >> on existing wiki's starting with these as X happens to add t... [14:29:55] fluff, I've done some investigation, I can respond on the ticket if you have already created one [14:32:53] 06Labs: Rename old database - https://phabricator.wikimedia.org/T150659#2792464 (10Fluff) [14:33:06] There we are :) [14:37:47] jynus: I'll be offline for a bit, but thanks in advance! [14:40:21] 06Labs, 10DBA: Rename old database - https://phabricator.wikimedia.org/T150659#2792479 (10jcrespo) a:03jcrespo [14:58:34] last change I swear [15:31:51] 06Labs: Request creation of hound labs project - https://phabricator.wikimedia.org/T148573#2792551 (10Andrew) Update: We haven't forgotten this, still waiting for Legal. [15:34:16] 06Labs: Request increased quota (floating ip) for "cvn" labs project - https://phabricator.wikimedia.org/T150209#2777950 (10chasemp) +1 [15:37:09] 06Labs: Request increased quota (floating ip) for "cvn" labs project - https://phabricator.wikimedia.org/T150209#2777950 (10Andrew) Done! I'm leaving this ticket open -- please let me know when you're done with the IP and I'll revert the quota and close this. [15:41:07] 06Labs, 10DBA: Add visitingwatchers to watchlist_count - https://phabricator.wikimedia.org/T150547#2789248 (10jcrespo) Dispenser, thanks for the separate task. If you are ok with that, I will put T59617 as a blocker of this, not a subtask. The request is reasonable, but this being a very dynamic field it won'... [15:55:19] 06Labs, 10DBA: Reduce watchlist_count threshold - https://phabricator.wikimedia.org/T150548#2789278 (10jcrespo) Dispenser, this is a more complicated request, because it is not about exposing more data in tools, it is about exposing more data in general. Looking at the api, currently the threshold is at "Fe... [15:56:57] 06Labs, 10DBA, 10Wikimedia-Site-requests: Reduce watchlist_count threshold - https://phabricator.wikimedia.org/T150548#2792575 (10jcrespo) [16:00:34] 06Labs, 10Labs-Infrastructure, 10DNS, 06Operations, and 2 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2792584 (10Dzahn) [16:00:54] 06Labs, 10Labs-Infrastructure, 10DNS, 10Mail, and 3 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2792585 (10Dzahn) [16:14:27] !log tools Disabling puppet across tools T146154 [16:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:14:30] T146154: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154 [16:18:23] !log Stopped irc-echo and puppet on shinken-01 for T146154 [16:18:24] Unknown project "Stopped" [16:18:33] !log shinken Stopped irc-echo and puppet on shinken-01 for T146154 [16:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Shinken/SAL [16:21:08] !log tools restarting all webservice jobs, watching webservicewatcher logs on tools-services-02 [16:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:21:46] !log enable puppet and run on tools-services-01 [16:21:47] Unknown project "enable" [16:22:06] !log tools enable puppet and run on tools-services-01 [16:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:22:28] !log tools kill maintain-kubeusers on tools-k8s-master-01, sole process touching NFS [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:30:58] !log tools Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154 [16:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:31:03] T146154: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154 [16:35:49] 06Labs, 10DBA: Add visitingwatchers to watchlist_count - https://phabricator.wikimedia.org/T150547#2792687 (10Dispenser) I'm fine with stale data. With the API, visitingwatchers represents 0.005 per watcher checked, in total ~3 - 30 second. Is there a way to get a last modified time like with information_sch... [16:37:07] it will be interesting to see if stashbot survives the read-only time. I think it should but that hasn't been tested [16:44:18] 06Labs, 10DBA: Add visitingwatchers to watchlist_count - https://phabricator.wikimedia.org/T150547#2792743 (10jcrespo) > Is there a way to get a last modified time like with information_schema? Not now, but we could add such information somewhere. [16:47:36] !log tools start restarting kubernetes webservice pods [16:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:57:08] !log tools stopped gridengine master [16:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:32:44] yuvipanda: Thanks for the broadcast :-) [17:32:44] multichill: :D yw [17:32:46] yuvipanda (CC zareen): i'm getting "504 Gateway Time-out" repeatedly right now at http://paws-public.wmflabs.org/paws-public/ ... [17:32:47] HaeB: yeah, ongoing labs maintenance. everything's going to be misbehaving for a few hours at least [17:32:47] ah ok, thanks [17:32:49] yuvipanda: so toollabs just stopped responding on me. Is the NFS maintanence happening now? I can't connect to the DB. [17:32:50] yes it is happening now [17:32:50] But why is it affecting the DB connection? [17:32:50] we're in the process of making the NFS system readonly, so all filesystem access to NFS is going to be screwed for a while [17:32:51] Oh. [17:32:52] Well shit. I think I just sent a screwy command to the DB, and I wanted to abort it in case I did. [17:32:52] Sweet I can still access it via my own project. :D [17:32:53] Tool Labs is returning 502s. NFS related? [17:32:54] Matthew_: yes. our readonly switch didn't go as well as expected, so all tools are down atm [17:32:55] Okay. [17:32:55] will come back readonly shortly hopefully [17:32:55] Okay, thank you. [17:32:56] !log tools Kicked off rsync of tools from labstore1001 to 1005 (T146154) [17:32:56] madhuvishy: Slashbot quit about 10 minutes ago. [17:32:56] Matthew_: yes i realized that after logging, thanks! [17:32:57] let's still !log anyway [17:32:57] No problem. [17:50:42] !log tools reboot all tools-k8s-* worker nodes [18:02:53] 06Labs, 10Labs-Infrastructure, 10DBA, 07Availability: Decide between proxysql and haproxy for labsdbproxy service - https://phabricator.wikimedia.org/T149844#2792952 (10jcrespo) Note @Marostegui commented in favor of proxysql before I edited the tasks based on my observations of recent bugs, I would like t... [18:04:46] 06Labs, 10Labs-Infrastructure, 10DBA, 07Availability: Decide between proxysql and haproxy for labsdbproxy service - https://phabricator.wikimedia.org/T149844#2792959 (10Marostegui) As we spoke on hangouts a week ago and after seeing the bug list you posted, I am fine with not going with it to production. M... [18:08:23] I've just recived this error: [18:08:26] error: commlib error: got select error (Connection refused) [18:08:26] error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error [18:10:12] freddy2001: are you trying to submit a job? [18:11:02] i have a job which ist running daily [18:11:24] freddy2001: right, no jobs are getting scheduled now due to the maintenance [18:11:31] https://lists.wikimedia.org/pipermail/labs-l/2016-October/004698.html [18:16:55] oh okay... thank you madhuvishy. i haven't kept this on mind [18:19:47] freddy2001: np! I'm updating the lists on this thread with ongoing progress https://lists.wikimedia.org/pipermail/labs-l/2016-November/004740.html [18:20:07] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2793017 (10jcrespo) [18:20:09] 06Labs, 10Labs-Infrastructure, 10DBA, 07Availability: Decide between proxysql and haproxy for labsdbproxy service - https://phabricator.wikimedia.org/T149844#2793014 (10jcrespo) 05Open>03Resolved a:03jcrespo > But as I expressed, I would like to still deploy it somewhere in our infra (as we discussed... [18:24:51] !log tools Tools NFS is read-only. /data/project and /home across tools are ro T146154 [18:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:24:55] T146154: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154 [18:26:20] 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2793044 (10madhuvishy) [18:45:56] hello, is the NFS maintenance more or less done? I see the most tools are back up, but my Ruby app is still down [18:46:18] it does not run on lighttpd and would probably require a manual restart, but `qstat` is returning errors [18:46:46] musikanimal: you can't submit new jobs [18:47:11] for right now, you mean? [18:47:12] it isn't done - will probably be end of day (worst case tomorrow) [18:47:20] oh dear, okay [18:47:32] thanks [18:47:45] https://lists.wikimedia.org/pipermail/labs-l/2016-November/004768.html [18:47:48] np! [18:51:37] madhuvishy: any idea when I'll be able to write to the filesystem? [18:52:34] I was hoping to submit a hotfix to make the Pageviews tool timeout when querying the Ruby app's API, since it's down entirely [18:52:44] everything else is working, so just need to make it bypass that part [18:53:20] this is the nightmare I was worried about when I set it up like this... making one tool dependent on another tool. My fault really, it should be self-contained [18:59:26] musikanimal: several hours at least, I think [19:01:11] musikanimal: yeah not before 5/6 pm PST - this is my best guess [19:01:42] alrighty [19:02:33] is there a way to be informed about scheduled maintenance? [19:02:47] musikanimal: labs-l and labs-announce [19:03:12] cool, thanks :) [19:03:25] it was announced >4 weeks ago, sorry it didn't reach you! [19:03:59] it happens [19:07:21] hey, has anyone else reported the tools.wmflabs.org filesystem being read-only? [19:07:52] at least for me? [19:08:48] maybe i'm doing things in a backwards/outdated way, but I ssh to login.tools.wmflabs.org then become $project [19:09:10] and i used to be able to work in that space, but now i'm getting an error that the filesystem is read only? [19:15:11] edsu: yes, see the topic here and T146154 [19:15:11] T146154: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154 [19:16:19] the tl;dr is that this is a planned maintenance of the NFS cluster that requires read-only downtime [19:17:25] announcements here https://lists.wikimedia.org/pipermail/labs-l/2016-October/004698.html and https://lists.wikimedia.org/pipermail/labs-l/2016-November/004768.html [19:18:19] Can this be announced on more public lists the next time? [19:18:36] There are a lot of Wikidata tools for example, I didn't knew about this. [19:18:47] sjoerddebruin: do you have suggestions for what lists to broadcast to? [19:19:22] labs-announce is our main place to send these, but we could see about getting that list to fan out to more places [19:20:16] bd808: oops, sorry i will close https://phabricator.wikimedia.org/T150681 now then [19:20:29] i am getting flooded with grid mails [19:20:30] Well, the Wikidata one for example. I'll subscribe now though to that one. [19:21:02] But this should be in Tech News too imo [19:21:17] Everyone that runs a tool should really be subscribed to labs-announce. We could do a better job of promoting that at account creation [19:21:25] yeah [19:21:28] I don't run tools, I use them [19:21:41] sjoerddebruin: agreed. it would have been nice to add to tech news [19:21:58] bd808: i think i was on there at one point, but fell off -- i'll get back on [19:22:09] bd808: i am getting flooded with tons of mails from Cron Deamon [19:22:26] Steinsplitter: probably because your jobs are failing [19:22:45] Steinsplitter: :( what are they telling you? jsut that the jobs are broke because of read-only stuff? [19:23:03] tools.sbot@tools-bastion-02:~$ qstat [19:23:04] error: commlib error: got select error (Connection refused) [19:23:04] error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error [19:23:06] think so^^ [19:23:20] Steinsplitter: yeah it's unable to submit a new job [19:24:20] if you want you kill your cron for now, and then you can restart it post the maintenance? [19:24:31] wonder if we can stop these emails being sent for now [19:25:58] turn off crond I guess? [19:26:29] as long as the grid master is offline/stopped there really isn't anything that the grid crons can do [19:26:38] yeah [19:28:46] bd808: (low-prio) I'm aware of the tools NFS maintainence, but as a non-tools-user I'd more-or-less discarded those notifications as not affecting me. However, I don't seem to be able to access NFS shares, despite being a different project. Is this an unintended side effect of the maintenance on tools, or was the announcement not broad enough? [19:29:17] stwalkerster: interesting, that shouldn't be the case - which project are you on? [19:29:51] account-creation-assistance [19:29:51] running cd /data/project hangs my shell [19:29:52] stwalkerster: thanks, let me fix that [19:35:50] !log tools Stopped cron on tools-cron-01 (T146154) [19:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:35:55] T146154: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154 [19:49:23] stwalkerster: were you trying accounts-db3? [19:49:30] yep [19:49:47] stwalkerster: cool, should be fixed now [19:50:18] confirmed, tyvm :) [19:50:20] madhuvishy: can you make two notes in the etherpad? one for stopping cron and re-enabling on post rsync and another for getting on tech news radar for similar maint [19:50:36] i made note for cron [19:50:43] tech news will add [19:51:09] ah cool way ahead of me :) [19:53:53] when is maintenance going to be over? I thought it was "FINISHED" judging from the mailing list, but webservice still isn't starting and my home directory is still r/o [19:55:16] bd808: I won't be merging anything until tomorrow, but would appreciate a look at the changes I made to https://gerrit.wikimedia.org/r/#/c/321169/ when you have a moment. [19:56:03] bd808: also, are you blocked for want of striker reviews? A lot of my +1s have been lost due to patchset updates but I can go through and re-read things if that's helpful. [19:56:20] Leloiandudu: it will be sometime as we wait for data migration, we'll blast a msg to labs-l on all clear [19:57:10] chasemp: is there a rought time estimate? an hour? 4 hours? 10 hours? [19:57:52] Leloiandudu: from above "yeah not before 5/6 pm PST - this is my best guess" [19:58:37] chasemp: thanks [19:59:16] Oh, well I will do something else than... [20:00:11] madhuvishy: Minor remark, only the US uses the weird MM/DD notation, very confusing for the rest of us. It's better to avoid using that notation ever ;-) [20:00:22] this is probably the biggest planned maint and most desperately needed maint in recent labs memory :) [20:00:42] andrewbogott: I'll check out the wikistatus patch. I am kind of blocked on Striker, but not completely. I put it down to focus on other things and haven't got back to fidning people to beg for +2 or at least fresh +1's yet. [20:00:46] multichill: yeah point taken, traditionally we announce in UTC as well [20:00:48] multichill: agreed [20:01:08] bd808: ok, lmk if there are particular patches that need attention sooner than others [20:01:19] i think I put in UTC time - but missed the date notation - saying 14 november would have been better [20:01:40] Yeah, just saying the full date seems to be the least confusing [20:01:51] +1 [20:02:16] 2016-11-14! it's an ISO standard and xkcd approved -- https://xkcd.com/1179/ [20:02:56] bd808: Bonus that it makes sorting really easy [20:03:47] Anyway, hope your NFS major migration goes well! [20:04:00] bd808: thank you :) [20:06:21] hey valhallasw`cloud [20:06:26] * valhallasw`cloud waves [20:06:40] 'o/ [20:07:02] Forgot to ask last week, are you coming this Saturday valhallasw`cloud? [20:07:42] What [20:07:46] what's on saturday? [20:07:47] wcnl? [20:08:46] yup [20:12:38] * valhallasw`cloud will check whether there are any other plans this weekend [20:13:22] something something netherlands? L) [20:13:25] :) even [20:13:58] wikimedia conference NL :-) [20:17:40] Maybe next time silence the cron deamon? Already got 30+ failure emails..... [20:17:44] multichill: if you need an unconference topic for wcnl -- https://wikitech.wikimedia.org/wiki/User:BryanDavis/Developing_community_norms_for_critical_bots_and_tools [20:18:38] No unconference I'm afraid. That seems to be a recurring topic. We already had that discussion years ago with the Toolserver [20:19:52] bd808: I document because otherwise I end up writing the same bot twice ;-) [20:20:08] multichill: can I assume the discussion went mostly nowhere? [20:20:52] Everyone agrees, but a lot of people end up not following it. [20:21:04] heh. that's the best. "I need a bot to do X, better look on wikitech. Oh apparently I wrote it 2 years ago." [20:21:05] Because it's difficult or they're lazy, they forgot, etc etc [20:21:37] multichill: *nod* this is why we need some kind of consistent peer pressure [20:21:50] but I don't have the magic answer for how to do that [20:22:34] We could probably have a set of recurring topics on different fields we could highlight every once in a while [20:25:17] friendly public shaming with a healthy dose of love and appreciation seems to work ok in other venues [20:30:19] At some point I want to have a "grade" in Striker for tools based on the completeness of their toolinfo data. [20:30:55] its dorky but may help some folks who have to get 5 stars or an A+ or whatever. [20:31:11] also it will give a way for a user to have some idea of what's up with a tool [20:32:05] The thing that I always struggle with is how to communicate that information to the user [20:32:19] how does a random person on a wiki know how and where to check for this? [20:32:27] *nod* [20:32:30] yeah [20:33:02] I'm hoping bringing to light really good tools and appreciating them will disseminate the right message [20:33:29] almost always people are looking for a quick way to do $things and that involves looking at other tools and tool maintainers [20:33:46] who do we even point them to as best example atm? [20:36:17] so... yeah. I need to kick off this RfC vote that would create a tools working group or whatever if we adopt it. I think that if we can draw in 3-5 people who what to make a difference and give them a tiny bit of power and a lot of agency then we may get somewhere. [20:36:57] bd808: You're still waiting in my inbox.... [20:37:13] heh. nobody responded to that peal for help email [20:37:20] *plea [20:37:33] I will! [20:37:54] Just not tonight [20:38:06] You want to have https://www.ssllabs.com/ssltest/analyze.html?d=tools.wmflabs.org for tools ;-) [20:38:25] multichill: yeah, something like that actually [20:41:22] I just read that some people have https://tools-static.wmflabs.org/meta/scripts/pathoschild.ajaxtransclusiontable.js and other scripts in their user scripts on wiki making that very very slow [20:42:01] Wonder why they have that in toollabs instead of just on meta [20:43:07] wut, it's in https://meta.wikimedia.org/wiki/MediaWiki:Common.js [20:43:17] mw.loader.load('//tools-static.wmflabs.org/meta/scripts/pathoschild.ajaxtransclusiontable.js'); [20:45:57] * valhallasw`cloud facepalms [20:46:01] bd808: just "pick a name for your tool" can be troublesome [20:46:49] multichill: that javascript should be on meta, although in this case it might be ok-ish because it's a static webserver [20:46:53] still serving from nfs, though [20:46:55] Platonides: yeah! I want to fix that up a bit in the next batch of work I do for Striker [20:47:01] and it's still on labs... [20:47:02] valhallasw`cloud / bd808 / ... Might want to comment at https://meta.wikimedia.org/wiki/MediaWiki_talk:Common.js#Toollabs_hosted_.js_in_Common.js.3F [20:47:23] how do you expect to fix it? [20:47:25] Platonides: bd808 do you mean because 'names are hard' or because it's hard to know what names are avail and what the existing nomenclature is? [20:47:37] chasemp: the first one [20:48:00] and sometimes even the "project" doesn't have a clear scope [20:48:14] Platonides: ah the names are hard part I probably can't fix. Although I could make a qute name generator [20:48:26] *cute [20:48:38] a button that says 'drop cat on keyboard' [20:48:40] "random tool that will probably not be needed for long" [20:49:02] I was thinking more about the bad error message when you pick a name that is already used problem [20:49:16] valhallasw`cloud: Didn't we have some sort of audit running at some point to check common.js common.css and related files for unwanted inclusions? [20:49:28] right, that plus a quick primer on names would be ok [20:49:48] but honestly no one has solved the 'names are hard' issue in all of technology [20:49:57] see: swift the object store and swift the programming language [20:49:57] multichill: we were planning to do the inverse, tool labs tools loading resources from e.g. google [20:50:10] 2 things are hard: naming, cache invalidation, and off-by-one errors [20:50:10] which is fine from a load perspective, but not from a privacy one :-) [20:50:21] Like that. We found google analytics on one of the smaller wiki's [20:50:30] bd808: xD [20:50:31] o_0 [20:50:37] really multichill? oh my [20:50:41] GA was on a wiki? [20:50:46] I remember that but from a long time ago [20:50:48] Couple of years ago I think [20:50:56] * bd808 stabs [20:51:09] added by an admin [20:51:26] And I'm fairly sure someone setup an automated check every once in a while [20:51:28] we need a code review system of on-wiki js [20:51:44] hashar probably has a check for it :P [20:51:47] that is one of the few topics I will straight run away from bd808 [20:51:48] :) [20:51:52] Or at least knows where it puts it's output [20:52:11] bd808: chasemp don't worry, code review system of on-wiki JS was discussed to death (literally) a while ago [20:52:22] oh yes, I've been in a few [20:52:34] yuvipanda: oh I know. mw-core was asked to build it every quarter [20:52:34] they always end w/ this [20:52:48] https://phabricator.wikimedia.org/T71445 [20:52:57] one of those things where I see the size of the scrollbar and nope out [20:53:12] "Reinventing a code review system inside of a wiki is nuts, but making people learn a code review system can't possibly work" [20:53:19] repeat [20:53:46] chasemp: we actually have an in-wiki code review system that just needs a bit of dusting off [20:53:53] *had [20:54:01] codurr! [20:54:06] it's just archived right? [20:54:12] I've seen it, and seen it proposed and have no objections at all to anyone doing whatever works [20:54:26] there are also article review tools [20:54:37] I'll bet a few hundred dollars nothing will ever actually get implemented there tho [20:54:44] ^ [20:54:53] and the burden of doing it is way more than it seems [20:54:55] but the real problem is that this is not a "sexy" project that will ever get a PM to care [20:55:00] I imagine there is a reason it's archived :) [20:55:25] well, that or the real problem is people should be stuck learning the tools of the trade and that's untenable in this case [20:55:26] bd808: can we remove text from inside parens in the lead section of articles to improve readability instead? [20:56:06] yuvipanda: only if we can also make machine recommendations of what other articles to read [20:57:31] I saw terminator genisys, so I know judgement day can only be delayed and not avoided [20:57:58] PAWS seems to be down ... [20:58:43] tobias47n9e: yep, ongoing labs maintenance. a few hours at least [20:58:46] see /topic [20:59:19] yuvipanda: Ooooooo :) [21:00:25] What/where is "/topic"? [21:00:39] Status: ~NOTICE~ Maintenance on entire NFS system today. ? [21:00:47] fnielsen_: if you're on IRC, your client should display that somewhere [21:01:04] if not you can write /topic and press enter and that should show you [21:02:17] Hahahaha [21:02:18] doh [21:02:27] oh boy [21:02:28] Oh no [21:02:31] yuvipanda: You told him to do that! [21:02:45] what happened [21:02:52] Sorry [21:02:57] fnielsen_: no worries [21:02:59] yuvipanda: on some clients, /topic without parameter clears the topic :D [21:03:04] aaaaah [21:03:06] ouch [21:03:08] > Wikimedia Labs (wikitech.wikimedia.org) stuff | Status: ~NOTICE~ Maintenance on entire NFS system today. Tools will be ready-only for a period of time. See labs-l or T146154 for updates. | jsub now defaults to trusty | Channel is logged: https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/ [21:03:09] T146154: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154 [21:03:10] (maybe all?) [21:03:14] it hasn't propogated to matrixland yet [21:03:22] and I remember /topic used to show me topic :( [21:03:48] Could also mode +t to prevent the topic from changing... [21:03:57] nah, we had it and it sucked [21:03:58] Matthew_: no thanks [21:04:10] And yes, /topic is client dependent. [21:04:14] Mine shows me the topic. [21:04:20] bd808: Just a thought xD [21:04:54] Thanks for the info. I am not that often on IRC. I can rarely remember the channel names. I went over to wikimedia-operations at first. :) [21:05:24] no worries, fnielsen_ :) [21:05:29] fnielsen_: no harm done :) [21:08:42] I wonder, htere has to be a universal show topic command but /topic does it for me too [21:12:59] Any word on when tools will be out of read-only mode? (I'm assuming I should wait until then to import some data updates for my application.) [21:13:42] 5/6pm PST is my best guess - best case [21:14:05] 01:00-02:00Z [21:16:21] Thanks! I'll probably wait till tomorrow to do my data update, then. [22:09:31] I went to a Kubernetes and "cloud native" conference last week. I decided that the best project I learned about there is because they have the cutest logo -- https://camo.githubusercontent.com/042a51c75ce23977378611ea6e3f97d75f9256dc/687474703a2f2f692e696d6775722e636f6d2f6e706b7a70386c2e706e67 [22:10:19] Yup, just go along with that. [22:13:22] my bias for all things unicorn is probably in play here [22:14:03] interesting [22:14:28] the idea is kind of neat. Build a custom OS that only supports exactly the app you are wanting to run [22:21:21] we've come full circle [22:22:12] yeah. it happens over and over again. Their target audience is apparently IoT developers with tiny cpus and ram. [22:22:44] of course a raspberry pi is like 1000 times more powerful than my first computer, but it's all relative. [22:23:00] what hardware gives, software takes away [22:23:29] bd808 but aren't unicorns nothing else than forbidden dogs ? [22:24:35] https://twitter.com/_youhadonejob1/status/797898825239175168 ;-p [22:25:01] :) [22:29:24] Read-only should not cause well-knkown files like replica.my.cnf) to go missing, right? [22:29:34] https://meta.wikimedia.org/wiki/User_talk:Krinkle#GUC_is_broken_today [22:29:40] https://tools.wmflabs.org/guc/?user=KrinkleBot [22:29:55] Krinkle: yup, shouldn't [22:29:56] Looks like whichever node it is running on, is missin this file [22:30:05] it's there when I look on tools-login [22:30:33] Krinkle: it's trying to read `/replica.my.cnf` [22:30:33] note the / [22:30:38] not from your homedir but from root [22:30:40] for some reason [22:30:57] yuvipanda: https://github.com/wikimedia/labs-tools-guc/blob/9d5f29ee9f9c47d3e81e2fe1aaec3ba2e18cc3d4/settings.php#L39 [22:31:03] Did that get changed? [22:31:27] idk what that looks for in PHP [22:31:38] but we didn't touch any of that [22:34:10] yuvipanda: seems like that expression evaluates fine when I run it over php -a in webservice shell [22:34:40] interesting [22:34:43] I'll restart [22:34:48] Krinkle: this is 'guc' right? [22:34:51] yes [22:34:52] Krinkle: ah, yes try that [22:34:54] Krinkle: as the tool user or as you? [22:35:02] File "/usr/local/bin/webservice", line 153, in [22:35:02] tool.save_manifest() [22:35:02] File "/usr/lib/python2.7/dist-packages/toollabs/common/tool.py", line 53, in save_manifest [22:35:03] tilde_file_fd = os.open(tilde_file_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY, 0o644) [22:35:03] OSError: [Errno 30] Read-only file system: '/data/project/guc/service.manifest~' [22:35:19] chasemp: I assume as the tool, I run become guc first on tools-login [22:35:40] it's trying to write a backup file before it opens the existing? [22:35:44] I guess restart doens't work in readonly [22:35:50] see the tilde on the end [22:35:59] Yeah [22:36:08] this is result of 'webservice restart' [22:36:13] funky [22:36:15] Krinkle: btw [22:36:16] Krinkle: kubectl exec -i -t guc-1887468267-bszkx /bin/bash [22:36:21] It's working now [22:36:25] will give you a shell inside the same container [22:36:29] restart came through regardless [22:36:35] must have been a node w/o the right package yuvipanda? [22:36:36] yuvipanda: 'webservice shell' does that, right? [22:36:44] chasemp: hmm? [22:36:56] Session ended, resume using 'kubectl attach interactive -c interactive -i -t' command when the pod is running [22:36:56] Pod stopped. Session cannot be resumed. [22:36:58] Krinkle: nope, webservice shell creates a new container with the same config, not the same config [22:37:10] same same but different [22:37:37] Krinkle: apparently [22:37:43] OK, looks like other tools are also affected [22:37:44] https://tools.wmflabs.org/wikiinfo/?wikiids=nl [22:37:58] Krinkle: does it work after the restart? [22:38:04] This one is less popular, so I won't resetart it so you can look at it? [22:38:13] yuvipanda: Guc is fine now, yes. [22:38:22] it wasn't able to write hte backup file but restart worked regardless [22:39:28] It uses the same code, https://github.com/Krinkle/toollabs-base/blob/e01c08c44ffb49f3a82650c2581dea37e2665a3c/src/GlobalConfig.php#L141-L146 [22:39:52] https://github.com/Krinkle/toollabs-base/blob/e01c08c44ffb49f3a82650c2581dea37e2665a3c/src/GlobalConfig.php#L90-L91 [22:41:18] Krinkle: I could just restart all k8s webservices again [22:48:30] yuvipanda: interesting broadcast. Didn't know about that :) [22:50:31] bah, IRC died for a bit [22:50:36] Krinkle: :D 'wall'. Check out 'write' too [22:51:02] !log tools shut down bastion 02 and 05 and make 03 root only [22:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:43:48] yuvipanda: what are 'wall' and 'write' ? [23:44:03] yuvipanda: wikiinfo and presumably others have not recovered yet. [23:44:11] Maybe a restart will fix it for all. [23:44:23] I'll leave it up to you. Maybe it'd be good to figure out why it went missing somehow though. [23:44:33] And why just that file, when the php files seem to be working? [23:44:45] Or rather, not the file, but the user id /context evaluating [23:44:59] somehow posix isn't recognizing the user of those k8s contexts [23:45:05] could be triggered by nfs outage I suppose [23:45:10] maybe cached somewhere [23:48:11] Krinkle: yeah, I'm unsure [23:48:22] Krinkle: that comes from nss, which is independent of NFS [23:51:16] Krinkle, wall and write are commands that allow you to send messages to other user's terminals [23:51:45] wall goes to all users, write goes to a given user (and optionally tty) [23:52:17] Krenair: are we through with all the locks and accounts ? [23:52:50] -> pm [23:57:22] So... when will tools be available again? [23:58:02] we are working on it so as soon as possible :) [23:58:10] Okay, cool. [23:58:32] I just need to kick-start a web service so