[08:19:11] PROBLEM - SSH on tools-exec-1221 is CRITICAL: Server answer [08:39:11] RECOVERY - SSH on tools-exec-1221 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [08:45:12] PROBLEM - SSH on tools-exec-1221 is CRITICAL: Server answer [09:05:11] RECOVERY - SSH on tools-exec-1221 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [09:42:19] 06Labs, 10DBA, 10Horizon: Tgr unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2338309 (10jcrespo) 05Open>03Resolved a:03jcrespo I will close this for now, the title task (Tgr unable to login on Horizon) is resolved. [11:18:16] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, 07Documentation: Create a "my first Python webservice" tutorial for Tool Labs - https://phabricator.wikimedia.org/T134494#2338409 (10Qgil) [11:25:26] 06Labs, 10Labs-Infrastructure, 10DBA: Queries of commonswiki_p.filearchive for fa_sha1 are slow - https://phabricator.wikimedia.org/T71088#726770 (10Volans) The query is not using any index: ``` MariaDB LABS localhost (none) > explain SELECT * FROM commonswiki_p.filearchive WHERE fa_sha1 = '0mpoldytyxspxrdbf... [11:26:39] RECOVERY - Puppet staleness on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [11:42:14] 06Labs, 10Labs-Infrastructure, 10DBA: Queries of commonswiki_p.filearchive for fa_sha1 are slow - https://phabricator.wikimedia.org/T71088#726770 (10valhallasw) One option might be to create a filearchive_notdeleted view (analogous to the _userindex one), with a `WHERE fa_deleted&1 = 0`. (https://git.wikime... [11:53:09] !log tools cherry-pick https://gerrit.wikimedia.org/r/#/c/280652 https://gerrit.wikimedia.org/r/#/c/290479 https://gerrit.wikimedia.org/r/#/c/291710/ on tools-puppetmaster-01 [11:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [11:55:25] 06Labs, 10Labs-Infrastructure, 10DBA: Queries of commonswiki_p.filearchive for fa_sha1 are slow - https://phabricator.wikimedia.org/T71088#2338487 (10jcrespo) [11:59:48] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [12:09:42] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:52:12] PROBLEM - SSH on tools-exec-1221 is CRITICAL: Server answer [12:59:25] 06Labs, 10Tool-Labs: puppet disabled on tools-pastion-01 - https://phabricator.wikimedia.org/T136552#2338643 (10valhallasw) [12:59:53] 06Labs, 10Tool-Labs: ssh on tools-exec-1221 closes connection - https://phabricator.wikimedia.org/T136553#2338657 (10valhallasw) [13:05:21] 06Labs, 10Tool-Labs: ssh on tools-exec-1221 closes connection - https://phabricator.wikimedia.org/T136553#2338682 (10valhallasw) ``` valhallasw@tools-bastion-02:~$ qmod -d "*@tools-exec-1221" valhallasw@tools-bastion-02.tools.eqiad.wmflabs changed state of "continuous@tools-exec-1221.tools.eqiad.wmflabs" (disa... [13:06:31] !log tools rebooting tools-exec-1221 [13:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:07:24] 06Labs, 10Tool-Labs: ssh on tools-exec-1221 closes connection - https://phabricator.wikimedia.org/T136553#2338685 (10Ladsgroup) Was it too resource consuming? [13:12:11] RECOVERY - SSH on tools-exec-1221 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [13:13:24] 06Labs, 10Tool-Labs: ssh on tools-exec-1221 closes connection - https://phabricator.wikimedia.org/T136553#2338701 (10valhallasw) No, the host was hanging, and thus had to be rebooted. The jobs mentioned above were running there, could not be restarted, and thus had to be force-deleted (otherwise SGE would have... [13:14:24] 06Labs, 10Tool-Labs: ssh on tools-exec-1221 closes connection - https://phabricator.wikimedia.org/T136553#2338702 (10valhallasw) 05Open>03Resolved a:03valhallasw Host is back online after a reboot, and the queues are re-enabled. [13:15:32] PROBLEM - Puppet staleness on tools-exec-1221 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [43200.0] [13:17:33] 06Labs, 10Tool-Labs: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218 - https://phabricator.wikimedia.org/T136495#2338707 (10valhallasw) This is now also happening on tools-exec-1203: ``` Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evalu... [13:20:34] RECOVERY - Puppet staleness on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [3600.0] [13:59:34] 06Labs, 10Tool-Labs: puppet disabled on tools-prometheus-01 - https://phabricator.wikimedia.org/T136498#2338794 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi indeed, I've reenabled puppet and cherry-picked https://gerrit.wikimedia.org/r/#/c/291710/ on tools-puppetmaster-01 so that's now running the sam... [14:01:47] YuviPanda: FYI the prometheus instance on tools now has a /tools/ prefix, IOW https://tools-prometheus.wmflabs.org/tools/status from https://gerrit.wikimedia.org/r/#/c/290479/ [14:02:17] godog: thanks! [14:02:52] valhallasw`cloud: np, trying to wrap up a few things before going on vacation tomorrow :D [16:58:36] 10Labs-project-extdist, 10MediaWiki-extensions-ExtensionDistributor: Download snapshot generates 404 for downloads - https://phabricator.wikimedia.org/T136564#2339112 (10Legoktm) p:05Triage>03Unbreak! a:03Legoktm [18:06:57] (03PS1) 10Jean-Frédéric: Migrate to use Intuition as a library [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/291776 (https://phabricator.wikimedia.org/T134565) [18:15:41] (03CR) 10Krinkle: Migrate to use Intuition as a library (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/291776 (https://phabricator.wikimedia.org/T134565) (owner: 10Jean-Frédéric) [19:09:35] So... where do I connect if I want to be able to access all of the replicas on tool labs? Is it c1.labsdb? Or did I read the docs wrong? [19:13:39] (03CR) 10Jean-Frédéric: Migrate to use Intuition as a library (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/291776 (https://phabricator.wikimedia.org/T134565) (owner: 10Jean-Frédéric) [19:14:54] (03PS4) 10Jean-Frédéric: Add local dev environment with docker-compose [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/291198 (https://phabricator.wikimedia.org/T136351) [19:15:20] (03PS5) 10Jean-Frédéric: Add local dev environment with docker-compose [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/291198 (https://phabricator.wikimedia.org/T136351) [19:42:53] hi, why recentchanges table of ptwiki_p have data from before the last 30 days? MariaDB [ptwiki_p]> SELECT MIN(rc_timestamp) FROM recentchanges; return 20160207180814, but should return something like 20160430... [19:42:58] 06Labs, 10Tool-Labs, 10xTools-on-Labs: xtools-articleinfo spawning a large number of duplicate webservices - https://phabricator.wikimedia.org/T132471#2339578 (10Matthewrbowker) a:03Matthewrbowker [19:43:37] this is resulting in a bug in a graph that uses this table: http://tools.wmflabs.org/ptwikis/Patrulhamento_de_IPs [19:48:32] 06Labs, 10Tool-Labs, 10xTools-on-Labs: xtools-articleinfo spawning a large number of duplicate webservices - https://phabricator.wikimedia.org/T132471#2339610 (10Matthewrbowker) p:05Triage>03Normal [19:49:01] 06Labs, 10Tool-Labs: Investigate why Joe is default editor on toollabs - https://phabricator.wikimedia.org/T100526#2339611 (10valhallasw) 05declined>03Open Reopening this. For some reason, `git` starts `joe` (as `valhallasw`) even though my ``` valhallasw@tools-bastion-03:~/src/pywikibot-core$ cat ~/.se... [20:02:24] 06Labs, 10DBA, 10Horizon: Tgr unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2339667 (10Andrew) Thank you @jynus [20:09:39] Matthew_: around? [20:09:50] Yes [20:10:15] Matthew_: do you want me to try moving xtools-articleinfo to kubernetes? that might fix this problem and also provide you with more isolation [20:10:25] it doesn't change anything for you [20:10:28] it still runs off of NFS [20:10:35] and you can change/deploy code as you used to [20:10:43] I only have PHP5.6 available now tho [20:10:44] If you think it will help. [20:11:00] Matthew_: so the question becomes, do you think it'll work on php5.6 [20:11:05] I don't know if our current code is php5.6 comparable though. [20:11:19] Matthew_: does it run on precise or trusty? [20:11:19] May I take a look and get back to you? [20:11:33] Matthew_: it runs on trusty, so should be ok. [20:11:35] Matthew_: sure [20:11:46] Okay. I'll look and let you know. Thank you! [20:12:19] Matthew_: yw. it could also allow you to have a http based health check, so you don't need to do webservice restart [20:12:44] Okay. [20:18:49] 06Labs, 10Tool-Labs, 10xTools-on-Labs: xtools-articleinfo spawning a large number of duplicate webservices - https://phabricator.wikimedia.org/T132471#2339690 (10MusikAnimal) Since adding Yuvipanda's magic script I haven't noticed any extraneous webservices. I think this can safely be closed as resolved. [20:20:39] 06Labs, 10Tool-Labs, 10xTools-on-Labs: xtools-articleinfo spawning a large number of duplicate webservices - https://phabricator.wikimedia.org/T132471#2339692 (10Matthewrbowker) 05Open>03Resolved >>! In T132471#2339690, @MusikAnimal wrote: > Since adding Yuvipanda's magic script I haven't noticed any ext... [20:24:26] YuviPanda: hey, it would be great (and if you have some time) to take a look at this [20:24:27] https://gerrit.wikimedia.org/r/#/c/291751/ [20:25:46] thanks [21:09:52] 10Quarry: Add date when query was last run - https://phabricator.wikimedia.org/T77941#832144 (10agray) This would be very useful for reports which use Quarry data (allowing us to timestamp the source for the end-user). The page currently reports the ID of the last run (in the source as `"qrun_id": 12345`) which... [21:19:55] 06Labs, 10Tool-Labs: jsub appears to act differently towards network requests - https://phabricator.wikimedia.org/T136588#2339811 (10Ladsgroup) [21:30:32] Betacommand: ping [21:30:47] Betacommand: Do you know the status of https://tools.wmflabs.org/?tool=wikiviewstats / https://tools.wmflabs.org/wikiviewstats it seems to be down [21:31:00] Is this obsoleted by https://tools.wmflabs.org/pageviews/ ? [22:00:57] 06Labs, 10Tool-Labs: jsub appears to act differently towards network requests - https://phabricator.wikimedia.org/T136588#2339939 (10Yamaha5) I test it with core. it shows this error! ``` WARNING: Waiting 10 seconds before retrying. ERROR: Traceback (most recent call last): File "/data/project/checkdictatio... [22:04:18] valhallasw`cloud: Can you update https://github.com/valhallasw/tsreports to have a url set in the url field on top? e.g. to https://tools.wmflabs.org/tsreports/ and also in the readme [22:04:28] Should we create a redirect from toolserver.org~/reports ? [22:10:34] Krinkle: did you ever get a chance to look at nagf? [22:14:00] Not yet [22:14:25] YuviPanda: Tell me :) [22:14:40] tools-login, become nagf, qstat [22:14:49] Krinkle: no qstat [22:14:51] Krinkle: kubectl get pods [22:14:56] no service.manifest [22:15:07] Krinkle: and if you change public_html/ it'll be instantly reflected [22:15:26] Krinkle: yeah, it's the testbed for the new k8s backend. You can see the yaml file it is running in nagf-deployment.yaml [22:16:21] YuviPanda: rc or deployment? [22:16:28] Krinkle: deployment [22:17:38] Krinkle: your logs are also back on access.log and error.log [22:22:02] YuviPanda: Interesting [22:22:05] So it mounts from NFS? [22:22:28] and then uses lighttpd to read public_html and write to logs [22:22:58] Krinkle: yup [22:23:04] Krinkle: it's the exact same code + config we run on gridengine [22:23:09] Yeah [22:23:24] Krinkle: and I'm currently working on adding a --backend=k8s option to webservice [22:23:30] Krinkle: so it'll just submit jobs to k8s instead of gridengine [22:23:33] and nothing else changes [22:23:36] So I assume once this is stable, it might replace that? (with the deployment yaml being implied, rather than explicit for each) [22:23:43] Krinkle: yup [22:24:04] Krinkle: there are still cases when webservice will want to run under gridengine (primarily, if they are spawning jobs on the grid themselves) [22:24:11] Krinkle: other than that, it's all positive. [22:24:14] YuviPanda: Do you intent to have a relatively simple way to get most of this infra but without NFS requirement? [22:24:32] Krinkle: I want to, but I haven't been able to think of a way to do that that is actually simple [22:24:52] Yeah, you need a way to bundle the code and ship it [22:24:57] Krinkle: yup [22:25:26] Krinkle: there is https://phabricator.wikimedia.org/T136264 for evaluating a proper PaaS for tools [22:25:30] Krinkle: which will definitely be NFS Free [22:25:49] Krinkle: https://phabricator.wikimedia.org/T136265 solicits comments on what the evaluation criteria should be [22:25:50] Having to bootstrap it from a public git repo isn't practical. And of course one woudl ideally still have easy access to logs and errors (and persist between restarts, and shared when having multiple replicas) [22:26:20] Krinkle: yeah, so most of these PaaS things have all of that covered. [22:26:21] I guess k8s would allow one pod to persist as local volume (still not NFS) [22:26:34] that's kindof a losing proposition though, since the node could go away [22:26:36] Separate from the actual http pod [22:26:50] and then ssh into that via k8s to view the logs [22:26:52] the actual solution to that is to deploy actual persistant storage (Cinder / Ceph) [22:26:56] Right [22:27:00] there's a separate ticket for log storage as well [22:27:02] logstash :) [22:27:10] where the actual solution is ElasticSearch + something [22:27:16] yeah [22:27:17] Though difficult with access controls. [22:27:20] yup [22:27:23] :) [22:27:30] ElasticSearch's access control plugin is properietary open core stuff [22:27:39] log storage itself is a good 6 month project [22:28:09] YuviPanda: Something like syslogd to NFS could work maybe [22:28:11] Krinkle: so we made an early decision to get rid of gridengine first, and then slowly get rid of NFS. [22:28:18] Krinkle: kubernetes has 'log collectors' as a concept [22:28:20] asynchronous and no hard dependency [22:28:26] so those would work [22:28:39] yeah, so that might be an intermediate next step [22:28:39] Yeah [22:28:53] I guess that's what k8s log collectors could be effectively [22:29:05] UDP or TCP to a subscriber which then persists separately [22:29:06] anyhow [22:29:09] g2h [22:29:11] g2g [22:29:14] Krinkle: kk! cya