[00:06:37] !log striker Update to 82eb1c3 (T144710) [00:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Striker/SAL [00:06:42] T144710: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710 [00:27:36] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2856064 (10NDKilla) @Dzahn Shouldn't be an issue but I copy/pasted MySQL output to be sure I didn't typo the DB name before saying it's deleted. Please only purge links associated with unknown databas... [00:58:46] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2856123 (10Dzahn) >>! In T146712#2856064, @NDKilla wrote > Below are wiki databases who still exits but whose stats havent been updated in over 400 hours > pnpwiki I took one random example and it doe... [01:00:41] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2856124 (10Dzahn) In case you say to remove the wiki suffix, pnp is 404 as well. https://pnp.miraheze.org/wiki/ Maybe just check those URLs in a browser? [01:02:55] 10Labs-project-Wikistats: all kinds of mixed issues with miraheze table (was: allthetropes is not updating on wikistats) - https://phabricator.wikimedia.org/T146712#2856138 (10Dzahn) [01:15:02] the file listing at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/ doesn't do the right thing I think with content-disposition, i.e. should display inline but downloads instead [01:15:25] what would be the task to file or the config to change? [01:17:17] godog: works for me? I think that is just plain old apache directory listings [01:17:57] godog: but a bug report would go in https://phabricator.wikimedia.org/project/view/668/ [01:19:06] those logs really could/should be organized by year or year/month too [01:19:29] just so the dir listing doesn't take forever to render [01:19:39] bd808: thanks for the pointer! [01:20:29] works for you on chrome? but yeah ff shows inline [01:25:49] godog: I don't use Google's corporate spyware ;) [01:26:31] dammit you didn't fall for it! [01:26:54] though chromium would do too [01:33:58] godog: works for me in chrome 54.0.2840.9 on osx [01:34:09] Hi. [01:34:15] https://en.wikipedia.org/wiki/Special:Contributions/10.68.17.202 [01:34:26] Is Labs using special IP addresses? MediaWiki isn't blocking these? [01:35:14] yeah, those are edits from tool labs I think... let me reverse the ip [01:35:39] tools-exec-1401.tools.eqiad.wmflabs [01:36:23] bd808: odd. thanks [01:36:50] Yvette: someone's bot editing while logged out I guess [01:37:16] I mean, it's my bot for some of them. [01:37:30] looks like it's pretty common -- https://en.wikipedia.org/w/index.php?title=Wikipedia:Biographies_of_living_persons/Noticeboard/Watchlist&action=history [01:37:30] Just curious that it's using that IP and that MediaWiki is allowing the edits. [01:37:55] You'd think a special range like that would be banned or something. [01:38:06] the 10/8 range is used for internal networking at WMF [01:38:25] I guess Labs --> production traffic is on the internal network? [01:38:39] sort of [01:38:44] I kind of assumed it went outside. [01:39:19] it's not in the same network segment [01:39:43] When I saw the 10. address, I thought it might some kind of proxy issue. [01:39:46] Like https://en.wikipedia.org/wiki/Special:Contributions/127.0.0.1 [01:39:48] But shrug. [01:39:58] but the x-forwarded-for headers that are set by the external varnish servers see the labs 10.68.x.x ip [01:40:01] Not sure why the script isn't logging in, but I'll wait a bit to see if it keeps happening. [01:41:59] if you pass ?assert=user or ?assert=bot with your request then it will be blocked by the aciton api if your login failed [01:42:11] which is the safe thing to do honestly [01:43:34] I think if we looked at the full X-forwarded-for headers on those requests we would see a public ip that belongs to the WMF. I've never fully traced it out though [01:43:57] Yeah, I've considered adding an assert. [01:44:14] Personally I'd rather have the edit "unattributed" than not have the edit, though. [01:44:23] :) yeah [01:44:49] as long as it doesn't raise some admin's ire to see ips making bot edits [01:45:03] Hah, yeah. [01:50:17] sometimes labs instances get privileged access to services on 'production' hosts that are there for labs-support reasons [01:50:39] sometimes you expect them to get the same unprivileged access that the public internet gets [01:51:39] right now, no NAT is done to translate labs private IPs into public IPs before they reach the public LVS machines [01:52:04] so varnish sees the private IP and sticks that in the XFF [02:05:12] AndyRussG, okay, from reading your ssh -vvv log and from that, this issue is way before you hit deployment-prep [02:05:13] you can't even get into labs [02:05:20] Krenair: yeah! huh I changed Gerrit key like 6 months ago, I was assuming that it's somehow coordinated with labs [02:05:26] Krenair: sorry I didn't see what u meant by -labs [02:05:27] 18<AndyRussG18> But https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack shows the old key, the one you pasted [02:05:27] 18<AndyRussG18> I should just update it there, no? [02:05:36] you need to update it there as well as gerrit, yes [02:05:55] gerrit's ssh key list controls SSH to gerrit.wikimedia.org:29418 [02:06:02] it is not linked to labs LDAP [02:06:02] * AndyRussG self trout-slaps repeatedly [02:06:20] there is a historical request laying around to link it that way somewhere [02:06:49] K lemme fix dat.... [02:09:21] K I assume it updates on a schedule... [02:10:06] I can't find the task I mentioned [02:10:13] krenair@bastion-01:~$ /usr/sbin/ssh-key-ldap-lookup andyrussg [02:10:13] ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDAMmAGeTcMpB7ZNlnWlqN16H0H8c37/XkhgVJBrQAngXfNLBP6aji0ldq0wrUmJFKPmu/BXftUr0fX02Rohz87qG1po242IXovdbjhPSGCi3sa4ofqL5smUuOKBJ/hkKGqxKjsAQ2sWktI6onHcX7MrTwUHxACal2c3LF6D+mKr0VpL0kXfnzCJvn2QW4Abjgc68623lb7Z1vR8tV3peepex/nrZ2hbSBBzuH+ImcEGm3Pm9/BRxcqIy6Z6OMpkmc+WAtqEmlekZNOhoa71DrN3yt0xxp7jZ3dUQ89ll49JqUHlJ7o35+V5YQCPK0vbT6aDAx3B7lnb8zg8f5cfFwj5MU27gkHFcaKMKBsZ3EYcRy+RseLT5H0SbcHtvGOoo3My7fy88V8KmwyR4W2CV+1ETXUdx [02:10:14] x+oA62mgtrcvfACZXqXG3FX3ZeaHviG+VzXZkWSaxFJ4oXSFR2QvFUQXvcG2ck0kysLFYMSoi02FbZBLcLP3IefsFMgtGjNICiWe6XcGeE5p6vPJ/kbIzwmexQM8DlNH1sGIKuWUSeuXCdeLGZaV99v5UVM6iDhLQQJzQwOBOQpT4v82uco/nZS/6gIOOrDcwaSmy4UDZ3bmZhL64kRteqau9PUWo+tzGgMKmOdMqAlUqY1FryXHrkdP0PcmoYJnQ8DbfyAh20aI1uUw== andrew.green.df@gmail.com [02:10:21] you should be able to ssh in now, AndyRussG [02:15:36] Krenair: yep all good! [02:15:46] Krenair: thx so much and sorry for the bother!!!!!! :D \o/ [02:42:01] 10Tool-Labs-tools-Pageviews: More user-friendly errors for when there is no data - https://phabricator.wikimedia.org/T152657#2856296 (10MusikAnimal) [02:48:13] 10Tool-Labs-tools-Pageviews: Vibrating page on large screens - https://phabricator.wikimedia.org/T152658#2856313 (10MusikAnimal) [02:48:30] 10Tool-Labs-tools-Pageviews: Vibrating page on large screens - https://phabricator.wikimedia.org/T152658#2856325 (10MusikAnimal) p:05Triage>03High [03:01:16] (03CR) 10Catrope: [V: 040 C: 032] Don't show approvals as 0 with new Gerrit version [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/325839 (owner: 10Alex Monk) [03:01:21] (03CR) 10jenkins-bot: [V: 040 C: 040] Don't show approvals as 0 with new Gerrit version [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/325839 (owner: 10Alex Monk) [03:07:56] (03Merged) 10jenkins-bot: Don't show approvals as 0 with new Gerrit version [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/325839 (owner: 10Alex Monk) [03:25:58] 06Labs, 10Tool-Labs: Warnings/errors in /var/lib/gridengine/spool/qmaster/messages - https://phabricator.wikimedia.org/T152477#2856381 (10scfc) T151980 changed `host_aliases`, but the grid master was probably not restarted afterwards, so it was still working with a reference to that host, therefore I decided t... [03:44:14] 06Labs, 10Tool-Labs: Warnings/errors in /var/lib/gridengine/spool/qmaster/messages - https://phabricator.wikimedia.org/T152477#2856386 (10scfc) Rebooting `tools-webgrid-lighttpd-1208` was not enough: I had to remove the directory `/var/spool/gridengine/execd/tools-webgrid-lighttpd-1208/active_jobs/4594249.1`.... [04:13:36] 06Labs, 10Tool-Labs: Redis replication from tools-proxy-01 to tools-proxy-02 broken - https://phabricator.wikimedia.org/T152356#2856393 (10scfc) (That was meant to say https://gerrit.wikimedia.org/r/#/c/325751/.) [05:11:37] seems like beta-cluster does not have latest code - https://gerrit.wikimedia.org/r/#/c/325732/ is missing [05:21:26] yurik: it looks like the update job is running successfully -- https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ [05:21:55] have you ssh'ed into deployment-tin and looked at the clone to see if the problem is obvious? [05:22:07] bd808, i'm looking at deployment-tin:/srv/mediawiki/php-master/extensions/JsonConfig/includes$ vi JCSingleton.php [05:22:14] its outdated [05:22:21] ok. [05:22:35] I wonder if the extensions mega repo is messed up? [05:22:42] * bd808 logs in to poke around [05:23:12] thx :) [05:38:18] yurik: I think that whatever magic updates the mediawiki/extensions.git repo is busted [05:38:28] yepii [05:38:33] it'd not showing any changes since 11 hours ago [05:38:34] https://github.com/wikimedia/mediawiki-extensions/commits/master [05:38:36] * yurik loves magic [05:38:55] which ... would probably be around the time that the gerrit upgrade started [05:39:18] and considering that gerrit was busted for a few hours... [05:39:19] so I guess file a bug and hope that hashar fixes it [05:41:12] https://phabricator.wikimedia.org/T152663 [05:41:15] thx! [07:03:03] (03PS1) 10BryanDavis: contrib: Add ldapPublicKey when creating dummy users [labs/striker] - 10https://gerrit.wikimedia.org/r/325890 [07:12:01] (03PS3) 10BryanDavis: Bump static, striker, and wheels submodules [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/325814 (https://phabricator.wikimedia.org/T144710) [07:38:14] 10Striker: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2856532 (10bd808) The ferm rules allow `$DOMAIN_NETWORKS`. Californium is `208.80.154.147/26` which must at least be a part of `$::network::constants::deployable_networks` or scap3 wouldn't work. I guess I need... [07:45:56] 10Striker: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2816717 (10yuvipanda) I verified that the packets are making it all the way to logstash1003, and being dropped by iptables. here's the expanded ruleset: ``` ACCEPT udp -- 10.128.0.0/24 anywhere... [07:57:21] 10Striker: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2856575 (10yuvipanda) We realized that californium is hitting logstash1003 over ipv6... ``` yuvipanda@logstash1003:~$ sudo ip6tables -L | grep 11514 ACCEPT udp 2620:0:860:101::/64 anywhere... [08:29:31] 10Striker: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2856583 (10yuvipanda) So I added a TRACE but looks like there's no traffic from californium to logstash1003 in general? [09:07:44] [13gerrit-patch-uploader] 15eloquence opened pull request #33: Clarify process for patch set updates (06master...06patch-1) 02https://git.io/v1g2i [11:30:48] is there a way to move .out and .err to a subdir such as /logs. -e -o works, but it it changing the .out/.err to someting random. [11:31:23] -e $HOME/logs isn't working as described in the docs [11:35:11] Krenair maybe you know? (seen you contributed to the conig git repo) [14:51:41] 10Labs-project-Wikistats: all kinds of mixed issues with miraheze table (was: allthetropes is not updating on wikistats) - https://phabricator.wikimedia.org/T146712#2857128 (10NDKilla) @Dzahn all urls are suffixed with wiki (mediawiki grants are to like '%wik%'.* or something). The databases in the second list... [15:42:17] 10Striker: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2857200 (10bd808) >>! In T151422#2856583, @yuvipanda wrote: > So I added a TRACE but looks like there's no traffic from californium to logstash1003 in general? The logging for Striker would be the only direct... [17:02:32] [13gerrit-patch-uploader] 15valhallasw pushed 2 new commits to 06master: 02https://git.io/v12aV [17:02:32] 13gerrit-patch-uploader/06master 14dc89db6 15Erik Moeller: Clarify process for patch set updates [17:02:33] 13gerrit-patch-uploader/06master 1494ab68f 15Merlijn van Deen: Merge pull request #33 from eloquence/patch-1... [17:04:07] [13gerrit-patch-uploader] 15valhallasw pushed 2 new commits to 06master: 02https://git.io/v12ad [17:04:07] 13gerrit-patch-uploader/06master 14d4f15f9 15Merlijn van Deen: Only add non-space characters to URL [17:04:08] 13gerrit-patch-uploader/06master 145797b2d 15Merlijn van Deen: Merge branch 'master' of https://github.com/valhallasw/gerrit-patch-uploader [17:14:50] 10Striker: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2857363 (10bd808) @yuvipanda helped me debug this further by temporarily disabling Puppet on californium and increasing the log verbosity of Striker by editing /etc/striker/striker.ini. I then tailed /srv/log/s... [17:19:41] 10Striker: Some Striker errors not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2857371 (10bd808) [17:28:28] 10Striker: Some Striker errors not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2857389 (10bd808) >>! In T151422#2857363, @bd808 wrote: > The `telnet` in the initial report is bogus because this is a UDP event stream not a TCP stream. But I started testing because I had a user report of a... [17:35:28] 06Labs, 10Tool-Labs: ssh from tools-puppetmaster-02 and tools-bastion-03 to tools-services-01 times out - https://phabricator.wikimedia.org/T152695#2857414 (10scfc) [17:36:19] 06Labs, 10Tool-Labs: ssh from tools-puppetmaster-02 and tools-bastion-03 to tools-services-01 times out - https://phabricator.wikimedia.org/T152695#2857433 (10scfc) [17:54:42] (03PS1) 10BryanDavis: profile: make less output nicer [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/325972 [17:54:44] (03PS1) 10BryanDavis: stashbot.sh: Add 'attach' command [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/325973 [18:22:45] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 2.95 ms [18:29:08] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [18:38:45] 06Labs, 10Labs-Infrastructure, 10Monitoring: nova: Monitor existence and membership for certain projects and accounts - https://phabricator.wikimedia.org/T152708#2857711 (10Andrew) [18:48:32] !log tools restarted toolschecker on tools-checker-01 [18:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:56:40] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [18:58:45] andrewbogott: how do I cleanup these super annoying ldap entries ^ [18:59:04] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:59:27] YuviPanda: are they ldap host entries? Or something else? [18:59:35] I assumed they were part of shinken's state storage [18:59:37] andrewbogott: pretty sure LDAP that gets picked up by shinken [18:59:47] * andrewbogott checks [18:59:47] andrewbogott: yeah that gets re-run every 10min [19:00:59] * Krenair facepalms [19:01:01] guys [19:01:04] shinkengen gets data from ldap [19:01:25] I have complained about stale ldap host data many times before [19:03:26] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms [19:03:49] ok, I cleaned up a bunch of things, including ^ [19:04:05] Shinkengen will need a rewrite anyway, since ldap hosts are no longer especially accurate (obviously) [19:04:33] fwiw what I did for prometheus is ask the openstack api for a list of hosts, seems to work well [19:04:36] /etc/shinken/generated/tools.cfg: host_name secgroup-lag-102 [19:04:37] /etc/shinken/generated/tools.cfg- address 10.68.17.218 [19:04:53] krenair@shinken-01:~$ ldapsearch -x aRecord=10.68.17.218 dn -LLL [19:04:53] dn: dc=ci-jessie-wikimedia-49929.contintcloud.eqiad.wmflabs,ou=hosts,dc=wikime [19:04:53] dia,dc=org [19:04:54] dn: dc=ci-trusty-wikimedia-151608.contintcloud.eqiad.wmflabs,ou=hosts,dc=wikim [19:04:54] edia,dc=org [19:04:55] dn: dc=ci-jessie-wikimedia-229502.contintcloud.eqiad.wmflabs,ou=hosts,dc=wikim [19:04:56] edia,dc=org [19:05:03] 3 contintcloud hosts with the same aRecord? yeah. . . [19:05:59] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [19:06:39] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [19:23:14] 06Labs, 10Labs-Infrastructure, 06Operations, 10netops, and 3 others: Provide read-only access to OpenStack APIs from WMF IP space - https://phabricator.wikimedia.org/T150092#2857881 (10Andrew) [19:29:06] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [19:39:46] I'd need to upgrade prometheus-node-exporter in labs too to the latest version, IIRC there's no salt for that and the closest is clustershell? [19:40:06] godog: tools or labs? :) [19:42:45] YuviPanda: labs in this case [19:42:56] I can start with tools tho [19:43:31] godog: so there's salt from labcontrol1001 but that won't hit deployment-prep (or other places with their own saltmaster) [19:43:31] godog: will hit tools tho [19:43:39] bd808: Just got a whole bunch of cron daemon error emails from labs, for multiple projects. For example: "error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error)." [19:44:00] which is greek to me [19:45:06] YuviPanda: ah, that'd do too, thanks! what about places with their own salt master? [19:45:11] actually those were all about an hour ago, but just got them [19:45:20] or rather just checked my email :) [19:45:22] godog: then need to find their salt master and use it. [19:46:23] also "error: commlib error: got read error (closing "tools-grid-master.tools.eqiad.wmflabs/qmaster/1")" [19:47:25] YuviPanda: thanks, any easy way to find all salt masters in this case? [19:48:08] godog: hmm, not sure. we used to have 'watroles' that helped you find instances with a role but that doesn't work right now [19:48:22] godog: I know that at least deployment-prep and integration have one. so I guess maybe just do those now and see how it goes? [19:50:35] YuviPanda: yeah I'll do that now, the next change is a commandline one in puppet that will break on the old version :( [19:50:40] kaldari: very odd -- sounds like a network connectivity issue [19:50:53] godog: ah, ouch [19:51:26] valhallasw`cloud: I'm just going to ignore them for now, but wanted to let you know just in case :) [20:07:43] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:23:16] 10Striker: Some Striker errors not getting into ELK cluster - https://phabricator.wikimedia.org/T151422#2858048 (10bd808) The Python stack traces are too long to encode in a UDP packet. ``` 20:22:04.269268 IP (tos 0x0, ttl 63, id 11981, offset 0, flags [+], proto UDP (17), length 1500) californium.wikimedia.... [20:27:11] 10Striker: Striker error log events not getting into ELK cluster due to UDP truncation of JSON payload - https://phabricator.wikimedia.org/T151422#2858064 (10bd808) [20:33:39] 06Labs, 10Labs-Infrastructure, 10Monitoring, 13Patch-For-Review, 07Wikimedia-Incident: labservices1001 crashed and sent no pages - https://phabricator.wikimedia.org/T152368#2858069 (10Andrew) 05Open>03Resolved [20:49:44] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, and 2 others: Developing community norms for vital bots and tools - https://phabricator.wikimedia.org/T149312#2858100 (10bd808) In a full 90 minute slot, @chasemp and I would expand the scope of this to cover: * Planning for the Tool L... [21:02:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:44:54] 06Labs, 10Labs-Infrastructure, 10Monitoring: nova: Monitor existence and membership for certain projects and accounts - https://phabricator.wikimedia.org/T152708#2857711 (10Krenair) These are keystone things rather than nova. As I mentioned earlier I also think we should have monitoring of the $project.wmfla... [22:25:19] 10Tool-Labs-tools-Other, 10Possible-Tech-Projects: Fix TreeViews to provide pageviews statistics for all articles of any wikiproject etc. - https://phabricator.wikimedia.org/T56184#2858577 (10Nuria)