[02:18:30] 10Analytics, 10Analytics-Kanban, 10Tool-Pageviews: Fix double encoding of urls on mediarequests api - https://phabricator.wikimedia.org/T244373 (10Nuria) Nice docs, thanks for taking the time @fdans Ping to @elukey to keep in mind whether the ATS will encode urls on the same manner [03:29:15] (03CR) 10Nuria: "Did not have time to look at this but * i think* I am following along and things make sense. Just making note this goes with this puppet " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/586447 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [05:27:40] 10Analytics, 10Analytics-Kanban, 10Analytics-SWAP, 10Product-Analytics: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10elukey) >>! In T247752#6086848, @Nuria wrote: > @nshahquinn-wmf was this issue resolved? Issue is still WIP, I am trying to figure out... [09:20:07] 10Analytics: Corrupted parquet statistics when querying webrequest data via Superset/Presto - https://phabricator.wikimedia.org/T251231 (10elukey) [09:33:40] so a query on two hours of webrequest takes less than a minute with presto :O [09:33:55] and I forgot also to use the webrequest_source partition [09:34:02] so I queried both text and upload [09:34:09] (bad Luca) [09:37:05] 10Analytics: Corrupted parquet statistics when querying webrequest data via Superset/Presto - https://phabricator.wikimedia.org/T251231 (10elukey) Deployed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/592888/ and the issue went away. [09:48:07] PROBLEM - Presto Server on an-presto1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [09:48:21] ahhaha lol [09:48:25] I broke it! [09:50:09] RECOVERY - Presto Server on an-presto1001 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [09:58:58] Apr 28 09:44:54 an-presto1001 presto-server[36857]: # There is insufficient memory for the Java Runtime Environment to continue. [09:59:01] Apr 28 09:44:54 an-presto1001 presto-server[36857]: # Native memory allocation (mmap) failed to map 872415232 bytes for committing reserved memory. [10:05:56] added some graphs to the dashboard [10:25:26] * elukey lunch! [10:40:59] milimetric: shots fired https://twitter.com/tmcw/status/1255018099343949826 [10:41:34] milimetric: this kind of goes in line with something that I was thinking yesterday... [10:42:11] milimetric: we could maybe preload the detail chunk while the user is browsing the dashboard? [10:42:18] like, asynchronously [11:05:22] I think for fast connections the delay is not noticeable and for slow it’s not good to preload [11:06:47] And I think the twit is wrong. Twittererer? Anyway, it’s obviously different if you’re looking at something pretty instead of fouc. But we’re not quite there. We should have nice placeholders instead of the gross “loading” stuff. You put placeholders in some places but we just need them everywhere [11:52:40] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: Upgrade to Superset 0.36.0 - https://phabricator.wikimedia.org/T249495 (10elukey) Differences that I noticed when checking the staging Superset: * http://localhost:9080/superset/dashboard/externalsearch/ contains a lot of `... [12:22:00] 10Analytics, 10Analytics-Kanban: deploy bots changes to AQS - https://phabricator.wikimedia.org/T251169 (10JAllemandou) a:03JAllemandou [12:24:52] 10Analytics: Create a Kerberos identity for zpapierski - https://phabricator.wikimedia.org/T251257 (10dcausse) [12:41:20] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: Upgrade to Superset 0.36.0 - https://phabricator.wikimedia.org/T249495 (10elukey) Created https://github.com/apache/incubator-superset/pull/9671 to upstream, it should fix some weird errors returned. Will wait until upstream... [12:41:55] another superset pull request --^ [12:42:13] we really need to think about moving to druid-sqlalchemy in Superset [12:46:38] 10Analytics: Create a Kerberos identity for zpapierski - https://phabricator.wikimedia.org/T251257 (10elukey) 05Open→03Resolved ` elukey@krb1001:~$ sudo manage_principals.py create zpapierski --email_address=zpapierski@wikimedia.org Principal successfully created. Make sure to update data.yaml in Puppet. Suc... [12:49:44] 10Analytics: jmads requesting Kerberos password - https://phabricator.wikimedia.org/T250560 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create jmads --email_address=jmaddock-ctr@wikimedia.org Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to jmaddock... [12:54:24] a-team: at a CityMD hoping to get tested (they have tests available). So I may miss standup if this long line doesn’t get processed quickly [12:54:53] 10Analytics, 10Patch-For-Review: jmads requesting Kerberos password - https://phabricator.wikimedia.org/T250560 (10elukey) 05Open→03Resolved a:03elukey [13:01:16] (03PS1) 10Joal: Update pageview to handle automated agent-type [analytics/aqs] - 10https://gerrit.wikimedia.org/r/592941 (https://phabricator.wikimedia.org/T251169) [13:06:49] ehhlloooo elukey! :) [13:09:35] (03PS1) 10Joal: Add automated pageview to per-article cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592943 (https://phabricator.wikimedia.org/T251169) [13:10:28] mforns: Hello! would you have a minute to try to debug the data-quality-hourly job with me? [13:15:03] ottomata: hello! [13:16:04] ottomata: one qs about the mediawiki_api_request that you re-ran - did it get the failed flag? If so I am wondering if the refine failure monitor didn't fire [13:16:05] hello shall we do eventgate-main kafka tls? [13:16:22] oh hm! elukey i think it must have since I reran with --ignore_failure_flag [13:16:42] if it didn't have the failure flag it would have been re-run by the regular job [13:16:48] i ddni't look explicitly though [13:16:54] there's another one failing right now too [13:17:13] /wmf/data/raw/eventlogging/eventlogging_WMDEBannerEvents/hourly/2020/04/26/13 [13:17:47] yeah I meant to check those but forgot [13:18:54] ah snap [13:18:54] there is a [13:18:54] Apr 28 00:02:26 an-launcher1001 refine_failed_flags_eventlogging_analytics[28145]: The following targets have the _REFINED_FAILED flag set: [13:18:57] _REFINIE_FAILIED [13:18:57] flag [13:18:58] Apr 28 00:02:26 an-launcher1001 refine_failed_flags_eventlogging_analytics[28145]: hdfs://analytics-hadoop/wmf/data/raw/eventlogging/eventlogging_WMDEBannerEvents/hourly/2020/04/26/13 -> `event`.`WMDEBannerEvents` (year=2020,month=4,day=26,hour=13) [13:18:58] here [13:19:01] /wmf/data/event/WMDEBannerEvents/year=2020/month=4/day=26/hour=13] [13:19:18] oh it found it? [13:19:25] is it just not sending an email? [13:19:39] well it should return 1 and raise an alarm [13:19:41] in icinga [13:20:21] so it is not working, great, will have to re-check [13:20:23] sigh [13:20:40] ottomata: is it ok if I run errand for ~20 mins before the TLS change? [13:20:44] sure [13:20:49] ack, brb! [13:30:28] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog: Produce an instrumentation event stream using new EPC and EventGate from client side browsers - https://phabricator.wikimedia.org/T241241 (10Ottomata) I brain bounced a tricky issue with @mforns yesterday. Camus ne... [13:31:03] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog: Produce an instrumentation event stream using new EPC and EventGate from client side browsers - https://phabricator.wikimedia.org/T241241 (10Ottomata) [13:38:41] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), 10MW-1.34-notes (1.34.0-wmf.20; 2019-08-27): Refactor EventBus mediawiki configuration - https://phabricator.wikimedia.org/T229863 (10Ottomata) [13:40:34] ottomata: back [13:40:48] ok great! [13:41:10] i think we can do IRC unless somethjnig bad happens ya? [13:41:24] codfw eventgate-main looks good. ssl and port 9093 [13:41:31] going to apply if ok with you [13:41:40] will log in -operations [13:43:21] ack [13:43:30] ok [13:45:33] it is interesting to see the little latency spikes for analytics [13:45:34] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=41&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-site=eqiad&var-ops_datasource=eqiad%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [13:46:10] oh hm and we lose the history because it doesn't have those broker names in the query anymore! [13:46:38] are there any librdkafka logs somewhere? [13:46:40] if you edit that graph and remove the kafka_broker label [13:46:42] you get it back [13:46:47] ah [13:46:52] that is just there to allow for isolatnig one via template var [13:47:04] looks like spikes were there before too, but were just a little lower [13:48:38] the spikes happen around the hour, weird [13:49:12] anyway, no big deal [13:49:18] would be great if we removed them [13:50:23] the rtt avg seesm to be higher now in general [13:50:28] but i guess that is expected [13:51:49] ottomata: just to double check, did we add all the specific TLS settings about ciphers etc.. everywhere? [13:53:09] ciphers in kafka? [13:53:17] no i don't think we specified anything there [13:53:38] oh you mean for the broker jvm settings? [13:54:04] I mean [13:54:05] kafka.ssl.cipher.suites=ECDHE-ECDSA-AES256-GCM-SHA384 [13:54:05] kafka.ssl.curves.list=P-256 [13:54:06] kafka.ssl.sigalgs.list=ECDSA+SHA256 [13:54:10] these are the librdkafka settings [13:54:12] oh no [13:54:23] i don' tthink those are anywhere except varnishkafka maybe? [13:54:37] they are in all places where I enabled TLS to jumbo [13:54:42] ah [13:54:45] like netflow, etc.. [13:55:22] hey joal just joined, wanna debug? [13:55:30] sure mforns [13:55:33] batcave? [13:55:38] omw! [13:55:39] btw am a little curious about kafka-main2001 rtt [13:55:40] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&from=1588078472843&to=1588082072843 [13:55:42] being so much higher [13:56:00] it isn't very hiigh, but just strange, maybye it is just doing more data [13:56:31] yep I think so, probably it is the leader for more partitions? [13:57:48] a few more but about as many as 2003 [13:57:55] but maybe just the active partitiions [13:58:06] this instance is only doing 2 topics [13:58:09] yeah ok so that makes sense [13:58:10] ok [13:58:18] elukey: shall I proceed with eqiad? [13:58:33] +1 [14:06:08] 10Analytics: Enable TLS encryption from Eventgate to Kafka - https://phabricator.wikimedia.org/T250149 (10Ottomata) [14:08:42] elukey: [14:08:43] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/592951 [14:09:10] eventgate main looks fine [14:09:18] and surprisingly rtt didn't increase very much [14:09:51] maybe that's because of some volume threshold thing? volume is less on codfw eventgate-main, so encryption adds more relative overhead? dunno... [14:11:47] could be yes [14:12:35] ottomata: sorry for the extra deploy :( [14:15:16] weeewhaa oh well, let's do those to staging eh?! [14:15:35] eventstreams first [14:16:50] ack [14:22:54] (03CR) 10Fdans: [C: 03+1] "Straightforward, lgtm!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592943 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [14:31:38] (03CR) 10Fdans: [C: 03+1] "lgtm as long as we're ok with keeping the same column names in cassandra. Not a problem for me." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/592941 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [14:31:59] eventstreams looks fnie [14:32:01] doing eventgate ones [14:39:00] super [14:40:22] so super cool [14:40:26] https://github.com/druid-io/pydruid [14:40:33] take a look to the CLI, the pandas support [14:48:35] (03PS1) 10Joal: Add logs to data-quality RSVD-Decomposition class [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 [15:08:48] (03CR) 10Milimetric: [C: 03+2] "if we do this, we should do it now. There may be some clients that break due to the enum change (old requests would now be illegal). I t" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/592941 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [15:09:51] (03Merged) 10jenkins-bot: Update pageview to handle automated agent-type [analytics/aqs] - 10https://gerrit.wikimedia.org/r/592941 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [15:10:40] (03PS2) 10Joal: Add logs to data-quality RSVD-Decomposition class [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 [15:11:28] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add automated pageview to per-article cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592943 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [15:14:34] sorry joal I forgot and dropped from batcave, but again there [15:14:39] mforns: I managed to replicate the error with my code [15:14:40] I mean I rejoined [15:14:44] mforns: OMW ! [15:14:47] k [15:34:44] (03CR) 10Nuria: "Let's please ad these changes to train etherpad as we will need to deploy aqs" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/592941 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [15:40:42] mforns: the matrix we create has the correct size, but skinny-matrix generates a random matrix of size 5x6 ??? [15:40:56] ?? [15:42:04] joal: might be that the 6 here is rsvdDimensions + rsvdOversample = 6? [15:42:47] very possible mforns :S [15:43:20] joal: could it be possible that until we have 144 data points, the skinny matrix will complain? [15:43:31] because 144 will fill 6 columns? [15:43:47] we stopped the job before 144, at 143 [15:43:54] hm [15:44:12] mforns: how could have event-navigation hourly succeeded then? [15:44:24] joal: can you try and run the job for 2020-04-27T22 [15:44:25] ? [15:44:42] joal: because it has more than 144 data points since day 1 [15:45:03] sure - Have the underlying data been generated? [15:45:43] joal: some are missing, the data for 20h and 21h [15:45:50] for the same day are missing [15:46:54] joal: or rerun with oversampling = 4! [15:49:09] Will try that mforns [15:52:03] heya addshore - When you query webrequest for wikidata stuff, can you please add "AND webrequest_source = 'text'" - It'll read a bit less data [15:52:22] yes! sorry! [15:52:31] Thanks addshore :) [15:52:37] * addshore adds it [15:52:50] mforns: works with oversampling = 4 [15:53:01] !!! [15:57:54] So you are right, the issue comes from config - We should actually have the minimum number of values to compute be: params.seasonalityCycle * (params.rsvdDimensions + params.rsvdOversample) [15:58:32] joal: aha makes sense! [15:58:50] mforns: And it removes a magic number [15:58:55] * joal happy with less magic [15:58:56] yes! [15:59:16] I like magic :[ [16:00:33] scala-sparkus! [16:01:28] abra-spark-scala! [16:02:21] xDDDDD [16:04:13] (03PS3) 10Joal: Fix data-quality RSVD-Decomposition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 (https://phabricator.wikimedia.org/T249759) [16:04:21] mforns: updated --^ [16:04:28] ok, looking [16:07:09] (03PS1) 10Joal: Bump jar version of data-quality ooie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592991 (https://phabricator.wikimedia.org/T249579) [16:09:05] (03CR) 10Mforns: [C: 03+1] "LGTM!" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 (https://phabricator.wikimedia.org/T249759) (owner: 10Joal) [16:10:30] (03CR) 10Joal: Fix data-quality RSVD-Decomposition (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 (https://phabricator.wikimedia.org/T249759) (owner: 10Joal) [16:12:12] (03CR) 10Mforns: [C: 03+2] Fix data-quality RSVD-Decomposition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 (https://phabricator.wikimedia.org/T249759) (owner: 10Joal) [16:12:22] joal: want me to merge? [16:12:40] mforns: testing now, give me a minute :) - then yes please! [16:12:47] k [16:17:20] (03Merged) 10jenkins-bot: Fix data-quality RSVD-Decomposition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/592972 (https://phabricator.wikimedia.org/T249759) (owner: 10Joal) [16:20:13] ok mforns - jenkins has merged, and it's good (test have been successful) [16:20:25] oh... sorry for the premature +2 [16:20:46] but yea! we found and fixed it :D [16:20:53] \o/ [16:21:10] mforns: can you please also merge the patch for jar-version bump (just above) [16:21:35] sure [16:26:51] (03CR) 10Mforns: [C: 04-1] Bump jar version of data-quality ooie job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592991 (https://phabricator.wikimedia.org/T249579) (owner: 10Joal) [16:27:45] (03CR) 10Joal: Bump jar version of data-quality ooie job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592991 (https://phabricator.wikimedia.org/T249579) (owner: 10Joal) [16:28:29] (03PS2) 10Joal: Bump jar version of data-quality ooie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592991 (https://phabricator.wikimedia.org/T249579) [16:29:23] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592991 (https://phabricator.wikimedia.org/T249579) (owner: 10Joal) [16:31:08] ping ottomata elukey milimetric [16:31:16] oo [16:38:39] 10Analytics: Statistics on a CN banner - https://phabricator.wikimedia.org/T251177 (10Ciell) [17:23:25] * elukey off! [17:31:08] a-team: does anyone have some time to help me with Cassandra testing? just need someone to setup a keyspace and ensure that data is loaded properly, not sure how long those things take [17:51:16] 10Analytics: Statistics on a CN banner - https://phabricator.wikimedia.org/T251177 (10Jseddon) Can you provide details on the landing pages used for both years, and campaign and banner names used for both years? [18:04:42] 10Analytics: Statistics on a CN banner - https://phabricator.wikimedia.org/T251177 (10Ciell) I did.... [18:05:37] 10Analytics: Statistics on a CN banner - https://phabricator.wikimedia.org/T251177 (10Ciell) [18:16:42] Hi lexnasser - I can do that :) [18:49:28] 10Analytics, 10Better Use Of Data, 10Product-Analytics (Kanban): Augment event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Ottomata) [18:50:09] 10Analytics, 10Better Use Of Data, 10Product-Analytics (Kanban): Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Ottomata) [18:50:29] 10Analytics, 10Better Use Of Data, 10Product-Analytics (Kanban): Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10jlinehan) [19:03:53] (03CR) 10Nuria: "Boy, was that 'bot' column handy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/592943 (https://phabricator.wikimedia.org/T251169) (owner: 10Joal) [19:21:16] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Krinkle) [19:21:29] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Krinkle) 05Open→03Resolved a:03Krinkle [19:21:37] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Krinkle) a:05Krinkle→03None [19:39:04] 10Analytics, 10Analytics-Kanban, 10Research: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Isaac) > Or ... I guess the last page move that still happened before the last import of wikidata sitelinks. @Milimetric yeah, that's rea... [19:55:13] (03PS1) 10Ottomata: [WIP] Add python/refinery/eventstreamconfig.py and use in in bin/camus to build dynamic topic whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/593047 (https://phabricator.wikimedia.org/T241241) [20:11:37] (03CR) 10Ottomata: "Haven't tested this with real camus yet but you get the idea..." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/593047 (https://phabricator.wikimedia.org/T241241) (owner: 10Ottomata) [20:12:34] (03PS2) 10Ottomata: [WIP] Add python/refinery/eventstreamconfig.py and use in in bin/camus to build dynamic topic whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/593047 (https://phabricator.wikimedia.org/T241241) [20:15:33] (03PS3) 10Ottomata: [WIP] Add python/refinery/eventstreamconfig.py and use in in bin/camus to build dynamic topic whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/593047 (https://phabricator.wikimedia.org/T241241) [23:04:48] 10Analytics, 10Analytics-Kanban, 10Research: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Nuria) >Not sure if this is something that should be fixed or if there's an easy workaround. @isaac: quoting "database" should work. Now s... [23:21:32] mforns: if you are still arround: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/591956/