[00:06:56] PROBLEM - Check the last execution of refinery-sqoop-mediawiki-private on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-mediawiki-private [01:20:40] (03PS1) 10Milimetric: Fix sqoop of cu_changes [analytics/refinery] - 10https://gerrit.wikimedia.org/r/520152 [01:20:54] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "hotfix to get cu_changes sqoop to work" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/520152 (owner: 10Milimetric) [01:40:26] !log deployed refinery, restarted refinery-mediawiki-sqoop-private [01:40:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:26:51] 10Analytics, 10DBA: hi.wikisource added to labs replicas? - https://phabricator.wikimedia.org/T227030 (10Marostegui) As @Reedy points out, hi.wiksource isn't created yet, not even its database {T219374}. As the wiki is marked as a public wiki, the process is as follows: - Database created - DBAs take over {T2... [05:27:46] 10Analytics, 10DBA, 10Data-Services: Prepare and check storage layer for hi.wikisource - https://phabricator.wikimedia.org/T219374 (10Marostegui) Adding #analytics as they are interested in knowing when this wiki finally gets created so they can sqoop data from it {T227030} [05:28:06] 10Analytics, 10DBA: hi.wikisource added to labs replicas? - https://phabricator.wikimedia.org/T227030 (10Marostegui) 05Open→03Declined [07:08:08] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey) [07:08:36] 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, and 2 others: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) 05Open→03Resolved [08:07:03] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on an-tool1006 is CRITICAL: CRITICAL [08:08:05] this is me testing --^ [08:49:38] 10Analytics, 10Operations, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) [08:52:21] 10Analytics, 10Operations, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) [10:10:44] !log reset-failed refinery-sqoop-mediawiki-private.service [10:11:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:20:24] RECOVERY - Check the last execution of refinery-sqoop-mediawiki-private on an-coord1001 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki-private [10:32:45] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10elukey) >>! In T220639#5292325, @faidon wrote: > - For RPKI specifically we would also like to differentiate between three states: no match, match but with no alternative prefix... [10:41:24] 10Analytics: User knissen can't access Superset - https://phabricator.wikimedia.org/T226431 (10kai.nissen) Thanks, Nuria! I can now login and see a selection of dashboards. I cannot access them, though. I'm starting to think that there might be a general issue with my account, since I cannot even login to gerrit... [11:01:02] * elukey lunch! [12:49:11] 10Analytics, 10Operations, 10Wikimedia-Logstash, 10observability: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [12:56:16] 10Analytics, 10Operations, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) >>! In T198152#4324161, @ema wrote: > Both [[ https://varnish-cache.org/docs/5.1/reference/varnishd.html#http-req-hdr-len | varnish ]] and [[http://nginx.org/en/docs/http/ngx_http_c... [13:07:11] 10Analytics: User knissen can't access Superset - https://phabricator.wikimedia.org/T226431 (10elukey) @kai.nissen Hi! If you are on freenode do you mind to join #wikimedia-analytics? It would be easier to debug the issue while live chatting :) [13:18:50] elukey: i'm here now :) [13:19:39] (regarding the superset permissions) [13:19:45] 10Analytics, 10Operations, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Ottomata) +1 I think this alarm should alert SRE. [13:20:11] Kai_WMDE: o/ [13:20:27] thanks for looking into it [13:20:54] so you were saying that you can pass the user/pass ldap login, but then when you get to superset you get some requests denied? [13:21:12] yes [13:21:37] can you retry now so I can check logs? [13:23:01] alright, should show in the logs now [13:24:52] Kai_WMDE: I can see 401s from httpd (that is in front of superset) [13:25:18] yes, that what my browser's network log shows, too [13:25:34] but nothing indicating why [13:25:36] weird [13:26:13] ah wait need to check one thing [13:27:10] Kai_WMDE: mmm no in theory the uid is knissen, and you seem to log in with the correct user [13:27:33] and you are in the nda ldap group [13:28:59] Kai_WMDE: are you able to see https://superset.wikimedia.org/dashboard/list/ ? [13:29:24] yes [13:29:24] because I noticed that you are trying to access a wmde dashboard, with your username on it.. So I guess you created it? [13:29:40] what about any other dashboard? [13:29:47] trying to understand what is breaking [13:30:03] yes, I created that one [13:30:22] and I tried to access other dashboards with no luck [13:33:14] this time, if I send my credentials again when trying to view a dashboard, it shows some diagrams [13:38:13] but I keep getting authentication prompts, 401s and the error message "There was an issue fetching the favorite status of this dashboard." [13:39:15] Kai_WMDE: are you using Chrome? If so can you try with a different browser? [13:39:38] yes, I'm on Chromium. but I tried Firefox already [13:40:00] because we had an issue reported with Chrome recently, this is why I am asking [13:40:15] but if it keeps happening with Firefox it is not related [13:40:22] when did you get your LDAP credentials? [13:40:23] recently? [13:40:39] because from what I can see it seems that httpd fails to authenticate you for some reason [13:41:41] the account was created many years ago [13:42:21] and I was added to "nda" around two years ago [13:42:41] super strange [13:43:20] can you retry now? I am trying to set debug logging for httpd [13:43:24] yes. and as I wrote in my last comment on the ticket, i found out yesterday that I can't even login to gerrit anymore. [13:44:23] I just retried [13:45:16] another time if you have patience please [13:46:29] done [13:47:27] 10Analytics: [BUG] Logging error of MobileWikiAppDailyStats for the iOS app - https://phabricator.wikimedia.org/T226219 (10Ottomata) > when we refine we get the schema for the 1st record we find and we assume always backwards compatibility of schemas. For EventLogging Hive, we actually use the latest schema. Th... [13:50:05] Kai_WMDE: I found some error that is totally cryptic to me, but it seems from LDAP [13:51:16] sounds like a hot lead :) [13:54:43] 10Analytics: Make JSONSchema aware Refine merge in existing Hive schema to read data - https://phabricator.wikimedia.org/T227088 (10Ottomata) [13:55:11] 10Analytics: User knissen can't access Superset - https://phabricator.wikimedia.org/T226431 (10elukey) Enabled httpd log to trace8 for mod_proxy and proxy_authnz_ldap, and got this: ` [Tue Jul 02 13:46:08.785077 2019] [authnz_ldap:debug] [pid 31706:tid 140223660926720] mod_authnz_ldap.c(523): [client 10.64.0.13... [13:58:42] not really, it seems that it tests your uid for group 'wmf' (in which you are not in) and then it gives up [14:06:12] there is at least one WMDE person that has working access to superset. so it /should/ work for people outside the "wmf" group. [14:07:07] Kai_WMDE: yep yep it should check the 'nda' group, the one in which you are in [14:07:14] but for some reason it doesn't [14:13:59] elukey: maybe add Operations to that ticket? [14:15:23] yeah I am going to ping a couple of people to get more info [14:15:39] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Hm, I just noticed there are more eventstreams processors than I had thought.... [14:15:55] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1003 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb1003.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:16:15] ok, thanks. I'll be afk for half an hour. [14:17:46] 10Analytics: User knissen can't access Superset - https://phabricator.wikimedia.org/T226431 (10elukey) @jbond @MoritzMuehlenhoff I'd need some advice in here since I am a bit ignorant about LDAP. As far as I can see from the logs, it seems that the `nda` group membership check is not performed for the user kniss... [14:19:22] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb1001.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:19:31] ottomata: --^ [14:20:06] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2006 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb2006.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:22:28] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2003 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb2003.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:24:42] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2001 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb2001.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:24:44] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2002 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb2002.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:25:30] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb1002.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:28:02] Kai_WMDE: can you try to log in http://grafana.wikimedia.org when you have time? [14:28:34] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2004 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb2004.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:28:50] ahhhh [14:28:59] the checker is missing the X-Client-IP header [14:31:24] sending a patch [14:32:37] jAH sorry elukey [14:32:39] good catch [14:32:53] it won't always be that [14:32:54] ottomata: https://gerrit.wikimedia.org/r/#/c/520247/ +1? [14:32:59] let me ammend. [14:34:01] ah you want to get it from the url right [14:34:08] I'll leave it to you then [14:34:18] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2005 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb2005.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:35:17] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/520247/3/modules/profile/files/eventstreams/check_eventstreams.sh [14:35:33] oh boy max time arg off [14:37:09] elukey: do you know if superset has worked for anyone who is not wmf? [14:37:16] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1004 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://scb1004.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:37:33] elukey: merged, thank you [14:39:33] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform: Stream Configuration Service - https://phabricator.wikimedia.org/T205319 (10Ottomata) > - As a product manager/analyst/engineer, I want to set the privacy whitelist s... [14:40:21] jbond42: yeah I think so, but we use the same config in other places (for httpd I mean) [14:45:58] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1003 is OK: OK: An EventStreams message was consumed from http://scb1003.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:47:27] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10Jan_Dittrich) [14:47:40] brb [14:49:39] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1001 is OK: OK: An EventStreams message was consumed from http://scb1001.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:50:11] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2006 is OK: OK: An EventStreams message was consumed from http://scb2006.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:52:31] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2003 is OK: OK: An EventStreams message was consumed from http://scb2003.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:54:58] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2001 is OK: OK: An EventStreams message was consumed from http://scb2001.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:54:58] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2002 is OK: OK: An EventStreams message was consumed from http://scb2002.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:55:48] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1002 is OK: OK: An EventStreams message was consumed from http://scb1002.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:58:34] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2004 is OK: OK: An EventStreams message was consumed from http://scb2004.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [15:04:28] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2005 is OK: OK: An EventStreams message was consumed from http://scb2005.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [15:07:32] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1004 is OK: OK: An EventStreams message was consumed from http://scb1004.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [15:24:32] 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Operations, 10Patch-For-Review: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Krinkle) On the GitHub mirror, I've closed outstanding pull requests on the mirror, and set the the "Archived" (read-only) flag on. This means it is st... [15:24:52] 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Operations, 10Patch-For-Review: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Krinkle) [15:28:15] elukey: logging into grafana works [15:32:28] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Ah still wrong. Full details. The varnish limit is 25 per varnish-backend inst... [15:32:40] nuria: ottomata in every limn-* repo there is a bunch of directories called "dashboards", "datasources", etc, which haven't been touched in years [15:32:51] it's the only thing remaining when removing the query directories [15:33:11] can we assume those are out of date and clean up the repo? [15:35:00] fdans: do you have retro in your cal? [15:35:15] nope [15:35:43] elukey: it wouldn't be now because nuria is in the managers meeting [15:37:48] jbond42: I know that "lea-wmde" can access superset. she's not in "wmf". [15:39:06] Kai_WMDE: ack thanks, elukey i thought i had responded anyway right now i cant see anything wrong with your config. lookig at something elses right now but will take another look in a sec. fyi when i logged in io got a http 500 but that maybe expected and is unrelated [15:41:55] fdans: I think that the calendar is messed up, me dan and Andrew had the invite [15:42:47] jbond42: ah yes if you don't have your account manually added first then you'll get a 500, I fixed it with upstream but they still have to release the new code :( [15:43:25] ack [16:00:36] elukey: if i look at the logs it looks like ldap auth is working. you can ignore the message about wmf, there are laters ones which mentione the nda group. [16:01:06] if i run the following command i also see that it appears that the downstream application is the one sending the 401 and only for some endpoints [16:01:10] awk '$(NF-2)="knissen" {print $1,$3,$4,$7}' /var/log/apache2/superset.wikimedia.org-access.log [16:01:31] further `grep knissen /var/log/superset/syslog.log [16:01:41] ` also shows some errors relating to the user [16:02:42] ping fdans [16:02:55] fdans: standdupp [16:08:51] 10Analytics, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) @Cmjohnson bump! [16:09:13] 10Analytics, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) a:03Cmjohnson Feel free to reassign [16:09:30] 10Analytics, 10Analytics-Kanban: Replace analytics mailto link in analytics.wikimedia.org - https://phabricator.wikimedia.org/T215362 (10Nuria) 05Open→03Resolved [16:11:45] jbond42: thanks a lot for checking [16:14:33] Kai_WMDE: do you have cookies enabled in your browser? [16:14:44] yes [16:14:55] 10Analytics: User knissen can't access Superset - https://phabricator.wikimedia.org/T226431 (10jbond) i looked into this a bit and i *think* the debug messages which mention `cn=wmf,ou=groups,dc=wikimedia,dc=org` can be ignored as we see the following in the logs as well ` [Tue Jul 02 13:46:09.168629 2019] [aut... [16:15:13] elukey: np i have added what i found to the ticket hope its helpfull [16:22:31] Kai_WMDE: and have you tried cleaning up cookies and starting an incognito window? [16:24:06] 10Analytics, 10Product-Analytics: Bug: Superset asking for my credentials on every page load - https://phabricator.wikimedia.org/T224159 (10JKatzWMF) @Nuria I'm on Mac OS 10.4 (Kate's out this week). [16:26:35] nuria: yes, both [16:27:54] T224159 seems to describe the same problem [16:27:55] T224159: Bug: Superset asking for my credentials on every page load - https://phabricator.wikimedia.org/T224159 [16:29:24] Kai_WMDE: that is what I was referring to before, but they 1) are able to login 2) have only Chrome behaving in that way [16:30:15] right, it says so in the comments. [16:58:25] * elukey off! [17:57:45] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) Per our conversation at standup we should probably have limits per host, does the... [18:02:07] 10Analytics: Move Eventstreams to kubernetes deployment pipeline - https://phabricator.wikimedia.org/T227122 (10Nuria) [18:11:29] 10Analytics, 10Analytics-Wikimetrics, 10Cleanup, 10GitHub-Mirrors, and 2 others: Archive analytics-wikimetrics (deprecated by Event Metrics) - https://phabricator.wikimedia.org/T219334 (10MarcoAurelio) @mforns Awesome. When you're ready (a-team I mean), please let me know and also which specific wikimetric... [18:16:09] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [18:23:31] 10Analytics, 10Analytics-Kanban, 10Operations: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Nuria) a:03mforns [18:23:48] 10Analytics, 10Analytics-Kanban, 10Operations: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Nuria) Moving to kanban to take care of this in Q1 2019 [18:42:24] milimetric: yt? [18:42:53] hey yea [18:43:12] milimetric: when you deployed and got error [18:43:20] milimetric: disk was full on teh deployment node right? [18:43:30] milimetric: or in teh target of the rsync? [18:44:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: New directories created under /wmf/data/event_sanitized and /wmf/data/event_sanitized are owned by yarn:analytics - https://phabricator.wikimedia.org/T225178 (10Nuria) 05Open→03Resolved [18:45:22] 10Analytics, 10Product-Analytics: Standardize datetimes/timestamps in the Data Lake - https://phabricator.wikimedia.org/T212529 (10Nuria) [18:45:24] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Add UTC 'Z' suffix to webrequest `dt` field. - https://phabricator.wikimedia.org/T217040 (10Nuria) 05Open→03Resolved [18:45:56] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Enable hcatalog integration for oozie - https://phabricator.wikimedia.org/T225310 (10Nuria) 05Open→03Resolved [18:46:35] maybe ottomata knows? [18:46:38] milimetric: no idea [18:46:40] doh! [18:46:42] nuria: no idea [18:46:53] ottomata: when we got scap issue yesterday [18:47:01] I didn't check, because I didn't know what the error was, and it seemed to me the deploy worked, because the file was there [18:47:10] (the wikis file in static_data) [18:47:21] milimetric: the error was no space left on device [18:47:26] so the disk couldn't have been completely full [18:47:39] doesn't make a lot of sense, how did it copy the file then [18:47:55] milimetric: cause the disk full might be on teh source not teh destination [18:48:15] milimetric: it is probably copied to teh cache once deployment is done [18:48:40] milimetric: cache is local to deployment1001 but as far as i can see there is tons of space in that machine [18:48:53] https://www.irccloud.com/pastebin/gnuRPrDw/ [18:49:01] maybe ottomata freed space? [18:49:30] nuria: well, no, before he did anything, I checked the destination, an-coord1001.eqiad.wmnet and it had the latest wikis file in /srv/deployment/analytics/refinery [18:49:46] that was when I got the first deploy error [18:49:52] milimetric: i see [18:49:55] so I kicked off that job, and it worked [18:50:05] and then later I deployed the python fix and again checked the deploy worked, and it did [18:50:09] even though I got another scap error [18:50:22] milimetric: error says "nrsync: write failed on "/srv/deployment/analytics/refinery-cache/revs/4e9894c0db04ee39be0f094d0ef11ebbff198834/.git/fat/objects/ecfba6edeb07a17a0d1b09981aff917e068d146c": No space left on device (28)\nrsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2]\nrsync error: error in file IO (code" [18:50:36] milimetric: "at receiver?" [18:50:40] an-coord1001 was full [18:50:58] i deleted some cached deploys on an-coord1001 [18:51:05] and tthen dan re depoyed to it (i think) [18:51:18] the git fat part of the deploy is just an rsync [18:51:36] ottomata: i see, then the cache setting that might not be working is so in the target environment ? is that supposed to take effect at all? [18:51:37] so the rsync might mostly succeed in pulling down files, but then fail before finishing the full deploy [18:51:43] you know, it probably copies the source first, and the git fat stuff second [18:51:44] target [18:51:48] that is where it is supposed to keep them [18:51:52] so the source is up to date, and then it runs out of space [18:51:55] so the scap can rollback just by changing symlinks [18:52:04] meaning the failure is really a failure, but for my purposes the deploy worked [18:52:18] i think maybe the cache # setting is failing because some of the cached deploys failed? [18:52:19] not sure. [18:52:27] like, maybe it doesn't delete or know about the failed ones? [18:52:28] dunno [18:52:31] ottomata: or the cache setting only takes effect at the source [18:52:36] ottomata: not teh destination [18:52:36] could be. [18:52:52] hm [18:52:54] i don't thikn so though [18:53:39] ya, there are no caches on the source [18:53:48] the sources on deploy1001 is just the git repo [18:54:13] /srv/deployment/analytics/refinery-cache [18:54:15] only existst on targets [18:54:32] scap uses git tags on the source [18:54:42] 10Analytics: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 (10Nuria) [18:55:08] 10Analytics, 10Release-Engineering-Team: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 (10Nuria) [18:55:35] 10Analytics, 10Release-Engineering-Team: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 (10Ottomata) This also often affects other hosts with relatively small /srv partitions, like notebook* hosts. [18:55:54] 10Analytics, 10Release-Engineering-Team: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 (10Nuria) Can the release engineering chime as to whether scap config settings should also delete artifacts from the target of the deploy? [18:56:30] 10Analytics, 10Release-Engineering-Team: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 (10Nuria) ping @greg [19:07:14] milimetric: let me know if think the formatting changes need more work (honestly) if not, i will merge and deploy wikistats [19:09:58] (03CR) 10Milimetric: "this hasn't fallen off my radar, just busy looking at the new snapshot data" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/519382 (https://phabricator.wikimedia.org/T220098) (owner: 10Fdans) [19:10:46] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Homepage: specify purging strategy - https://phabricator.wikimedia.org/T219252 (10nettrom_WMF) p:05Triage→03Normal a:05nettrom_WMF→03MMiller_WMF With the deployment of [[ https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/516520/... [19:13:20] nuria: I think the version you have is better than what's live right now, and I think we shouldn't spend too much more time on it now, it's a big time suck to sit there being super picky about how every little thing rounds, and ultimately the question is "do people get value out of these numbers", and we can't answer that [19:13:40] (03CR) 10Milimetric: [C: 03+2] Change number formating to show less decimal places [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/519036 (https://phabricator.wikimedia.org/T200070) (owner: 10Fdans) [19:13:49] milimetric: k, ya, i think a ui less cluttered is of value so yes +1 [19:14:12] exactly, it's a good improvement [19:14:18] milimetric: will wait couple hours to deploy cause i think you are also looking to other chnages [19:15:05] *changes [19:15:57] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Growth - https://phabricator.wikimedia.org/T226854 (10nettrom_WMF) Growth has three EventLogging schemas in use where data is retained: * HelpPanel * Homepag... [19:16:29] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10nettrom_WMF) [19:16:31] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Growth - https://phabricator.wikimedia.org/T226854 (10nettrom_WMF) 05Open→03Resolved [19:16:46] nuria: for wikistats? [19:16:59] milimetric: yes, but maybe i am wrong? [19:17:04] no, I just pinged fran to say I'm not going to look at his change for a couple of day [19:17:06] *days [19:17:13] 'cause I wanted to focus on the data [19:17:31] if I'm done before Friday I'll just deploy it, I have honorary ops week this week [19:23:52] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10ayounsi) https://as286.net/data/ana-invalids.txt is RPKI invalid data, crossed with the global routing table. Take for example: > 2.59.118.0/24;srcAS=60721;altpfx=NONE;iROAS=2.... [19:40:31] milimetric: sounds good! [19:55:29] (03CR) 10Nuria: Add access type and access site to mediacounts hourly dataset (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/517426 (https://phabricator.wikimedia.org/T225910) (owner: 10Fdans) [20:05:22] (03CR) 10Nuria: [C: 04-1] "I do not think this change works. Probably the convention should be to have daily/monthly granularity unless the config specifies otherwis" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/518716 (https://phabricator.wikimedia.org/T226397) (owner: 10Fdans) [20:09:05] milimetric: i will deploy wikistats tomorrow per our wed deployment policy then whatever has been merged, i just did one additional CR that I think needs a bit of work [20:14:32] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10Product-Analytics: mediawiki_history missing page events - https://phabricator.wikimedia.org/T205594 (10Milimetric) Part of the confusion here is that Special:Log is matching the page exactly, while looking for page_title or page_title_historical... [20:20:41] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2: Values in map view show unnecessary decimal digits - https://phabricator.wikimedia.org/T200070 (10Milimetric) (btw didn't mention this on the review but I looked at the pageviews-by-country map and it looks much nicer with the new formatting) [20:29:23] 10Analytics, 10MobileFrontend, 10Readers-Web-Backlog: Having trouble setting up MobileFrontend for development - https://phabricator.wikimedia.org/T226071 (10Milimetric) This was on a relatively new vagrant I set up like a month or two ago, and I just enabled the mobilefrontend role, vagrant provision, and f... [20:35:11] 10Analytics, 10Product-Analytics: Update R from 3.3.3 to 3.6.0 on stat and notebook machines - https://phabricator.wikimedia.org/T220542 (10mpopov) [20:45:26] nuria: I figured out the mystery [20:45:38] I think Joseph reran the 2019-05 snapshot with the new logic, and updated prod [20:45:45] that way everything just worked [20:46:02] makes today make a lot more sense, in more ways than just the jobs :) [20:46:16] he was like referencing the 2019-05 snapshot everywhere and I was very lost for a little bit [20:46:59] indeed, he did and noted it on the train notes, line 31 currently: v [20:47:00] https://etherpad.wikimedia.org/p/analytics-weekly-train [20:54:11] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10nettrom_WMF) @Niharika : Does AHT have any EventLogging schemas that are whitelisted? [21:00:00] 10Analytics, 10Community-Tech, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF) p:05Triage→03Normal [21:00:17] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10nettrom_WMF) p:05Triage→03Normal [21:03:22] milimetric: on meeting but can talk in 1 hr [21:19:51] oh, I'm heading out, was mostly talking to myself, but it's nice that things make sense now. Have been looking at data and it looks good so far. [21:23:43] 10Analytics, 10Operations, 10hardware-requests, 10netops, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10wiki_willy) @elukey - just wanted to follow up on this...@RobH will dig around for some quotes and recommendations [21:23:47] 10Analytics, 10Operations, 10hardware-requests, 10netops, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10RobH) So these are all in warranty until 2020-05-31, so we will want to add in 10G NICs that are covered by Dell's system warranty. I... [21:29:48] 10Analytics, 10Operations, 10hardware-requests, 10netops, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10RobH) [21:51:15] 10Analytics, 10ExternalGuidance, 10Product-Analytics: [Bug] `init` and `mtinfo` event counts drop drastically since June 17 2019 - https://phabricator.wikimedia.org/T227150 (10chelsyx) [21:58:19] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform: Stream Configuration Service - https://phabricator.wikimedia.org/T205319 (10Ottomata) FYI, I wanted to know more about Apache Atlas, so I set it up a standalone on st... [22:01:49] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform: Stream Configuration Service - https://phabricator.wikimedia.org/T205319 (10Ottomata) Check out: https://atlas.apache.org/1.2.0/Glossary.html and https://atlas.apach... [22:13:26] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10wiki_willy) @Ottomata - can you reach out to Chris on IRC and schedule a time with him on this one? Sounds... [22:52:35] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform: Stream Configuration Service - https://phabricator.wikimedia.org/T205319 (10Nuria) nice, seems a fit for data governance (cc @chasemp ) but for stream config? How wou... [22:54:04] milimetric: ya, saw joals note too [22:54:11] milimetric: I LOVE we have that etherpad [22:54:21] milimetric: now i can breath more easily you know [22:54:39] yeah it’s great [22:54:40] milimetric: let's you and i spend some time testing snapshot tomorrow [22:54:59] sure, I’ve been running some rough metrics, all looks good so far [22:55:10] but haven’t been too methodic [23:32:53] 10Analytics: [BUG] Logging error of MobileWikiAppDailyStats for the iOS app - https://phabricator.wikimedia.org/T226219 (10chelsyx) Thanks @Ottomata and @Nuria ! Since this schema is used by both the iOS and Android team and the two teams are using different revision (with different required fields), I will disc...