[00:10:22] 10Analytics: Host API for token persistence dataset - https://phabricator.wikimedia.org/T164280 (10leila) [00:20:08] 10Analytics, 10Research-Backlog: Provide a spark job processing history and text to extract citations diffs - https://phabricator.wikimedia.org/T158896 (10leila) [00:22:00] 10Analytics, 10Product-Analytics, 10Reading-analysis, 10Research-Backlog, 10Research-consulting: Propose metrics along with qualifiers for the press kit - https://phabricator.wikimedia.org/T144639 (10leila) [00:22:23] 10Analytics, 10Product-Analytics, 10Reading-analysis, 10Research-Backlog, 10Research-consulting: Report on Wikimedia's industry ranking - https://phabricator.wikimedia.org/T141117 (10leila) [00:22:36] 10Analytics, 10Product-Analytics, 10Reading-analysis, 10Research-Backlog, 10Research-consulting: [Epic] Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221 (10leila) [00:24:34] 10Analytics, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Implement technical details and process for "datasets_p" on wikireplica hosts - https://phabricator.wikimedia.org/T173511 (10leila) [00:24:45] 10Analytics, 10Data-Services: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513 (10leila) [00:24:59] 10Analytics, 10Data-Services: Document the process for importing a new "datasets_p" table - https://phabricator.wikimedia.org/T173514 (10leila) [00:29:27] 10Analytics, 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, and 3 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921 (10leila) [00:34:13] 10Analytics, 10Data-release, 10Privacy: An expert panel to produce recommendations on open data sharing for public good - https://phabricator.wikimedia.org/T189339 (10leila) [00:34:49] 10Analytics, 10Data-release, 10Privacy: An expert panel to produce recommendations on open data sharing for public good - https://phabricator.wikimedia.org/T189339 (10leila) @Nuria I've removed the Research tag but myself and others from our team are subscribed to this task. If you pick this up again and nee... [00:37:01] 10Analytics, 10Research: Check home of bmansurov - https://phabricator.wikimedia.org/T226956 (10leila) @elukey I'll remove Research from this task and myself as the assignee. Let me know if you need my help somewhere else. [00:37:10] 10Analytics: Check home of bmansurov - https://phabricator.wikimedia.org/T226956 (10leila) [00:37:20] 10Analytics: Check home of bmansurov - https://phabricator.wikimedia.org/T226956 (10leila) a:05leila→03None [00:51:29] 10Analytics, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (next), 10Wikimedia-production-error: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 (10Pchelolo) 05Open→03Invalid We are very actively moving to eventgate, so I don't think any work here is... [01:14:25] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Requests for new JobQueue monitoring capabilities - https://phabricator.wikimedia.org/T175780 (10Pchelolo) 05Open→03Resolved I think we got all of this, except per-wiki metrics. it's tracked in T175952 so I'm gonna close this on... [01:14:29] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088 (10Pchelolo) [04:09:51] 10Analytics-Kanban, 10Product-Analytics: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Nuria) Proposal: - let's release data for editors with 5+ edits per country (regardless of size of bucket) per wiki, let's not release distinctively t... [04:50:55] 10Analytics, 10Analytics-Kanban, 10ExternalGuidance, 10Product-Analytics, 10Patch-For-Review: [Bug] `init` and `mtinfo` event counts drop drastically since June 17 2019 - https://phabricator.wikimedia.org/T227150 (10chelsyx) >>! In T227150#5317683, @dr0ptp4kt wrote: > @chelsyx I forget some of the detail... [04:53:39] 10Analytics, 10Analytics-Kanban, 10ExternalGuidance, 10Product-Analytics, 10Patch-For-Review: [Bug] `init` and `mtinfo` event counts drop drastically since June 17 2019 - https://phabricator.wikimedia.org/T227150 (10chelsyx) @Nuria Thank you for the fix! The metrics since June 17 2019 come back up, but... [05:47:53] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826 (10elukey) Very nice investigation, I was in fact trying to figure out the purpose of the last port and you solved it :) I... [06:22:10] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826 (10elukey) This is a pyspark2 session opened on stat1007: ` elukey@stat1007:~$ sudo netstat -nlpt |... [10:28:20] ok rocm 2.5 with tensorflow-rocm 1.13.3 seems to work fine on stat1005 [10:28:25] 2.6 is broken [10:28:26] :( [10:28:44] will try to report it to upstream [10:28:47] and update documentation [10:42:03] aaand https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu [10:42:05] \o/ [10:43:30] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu [10:43:53] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) [10:47:18] * elukey lunch! [11:53:08] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10Jan_Dittrich) @Nuria I tried and got a 401 after username/password. The credentials seems to be correct though: my https://grafana.wikimedia.org/ works fine with the credentials. [11:57:35] opened https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559 [11:58:54] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) All right so ROCm 2.5 and tensorflow-rocm 1.13.3 seems to work. Other versions of TF (1.13.4 and 1.14.0) lead to the follow... [12:32:21] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Beeline does not print full stack traces when a query fails - https://phabricator.wikimedia.org/T136858 (10elukey) Error on stat1004 was due to an experiment that I was doing to nail down why the `--verbose` option leads to: ` java.io.FileNotFoundExcepti... [12:38:27] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10elukey) @Jan_Dittrich hi! The superset LDAP config requires the uid, that usually is not different but in your case is `wmde-jand`. I amended it now your account in superset, can you retry? [12:47:50] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10Jan_Dittrich) Sadly, it does not work, still get > This server could not verify that you are authorized to access the document requested. Either you supplied the wrong credentials (e.g., bad password)... [12:50:48] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Operations, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) @EBernhardson analytics-search user should now be able to access the auth file [13:00:08] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10Jan_Dittrich) @elukey I see – will you create the nda group task, or shall I do create it? [13:05:09] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10elukey) >>! In T227093#5324273, @Jan_Dittrich wrote: > @elukey I see – will you create the nda group task, or shall I do create it? Please go ahead and create it, I'll work on it as soon as possible :) [13:08:32] (03CR) 10Ottomata: Use JsonParser to parse event data rather than YAMLParser (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) (owner: 10Ottomata) [13:09:32] (03CR) 10Ottomata: Use JsonParser to parse event data rather than YAMLParser (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) (owner: 10Ottomata) [13:33:14] 10Analytics: Jan Dittrich would like to have access to superset - https://phabricator.wikimedia.org/T227093 (10Jan_Dittrich) did so – T227774 [14:34:48] elukey: o/ see ping in -ops, mean t to write here [14:35:04] spark yarn client mode job seems to be stuck on Registering block manager analytics1076.eqiad.wmnet:39211 on stat1004 and stat1007 [14:37:00] ottomata: was it a old spark session? [14:39:54] no [14:39:58] brand new [14:40:06] also, hm, you only have one port set? [14:40:08] can you tell me how to repro? [14:40:10] R_SERVICE(tcp, 13100:13100, $ANALYTICS_NETWORKS [14:40:12] ? [14:40:15] don't we need a range? [14:40:18] oooffff [14:40:31] yes for sure, that is a mistake [14:40:33] fixing [14:40:45] :) [14:41:08] I only tried to create a session and not executed code [14:41:25] hm in any case, hm [14:41:27] spark.driver.blockManager.port 13000 [14:41:28] but [14:41:43] oh oh [14:41:48] that was an76 trying to connect from 39211? [14:41:53] to 13000 i guess? [14:41:58] yes yes exactly [14:42:14] failing miserably [14:42:20] ah yes Registering block manager stat1007.eqiad.wmnet:13000 [14:42:28] oh ya [14:42:46] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/522105 [14:42:46] ah i see max:max :) [14:42:59] this is what happens when you file code changes in the morning [14:43:03] haha [14:43:27] good thing we're catching this stuff on stat boxes :D [14:44:54] yep yep [14:45:00] setting ports and firewall is tricky [14:45:12] plus if I don't set the var names correctly in puppet.. [14:46:07] I need to learn how to do simple tests in spark2 [14:46:16] creating a session is not enough [14:46:49] ottomata: fixed, can you retry? [14:47:03] doing [14:50:22] looks like blockmanager works now! [14:50:23] thx! [14:50:30] \o/ [14:51:47] hmm elukey not sure if spark is working yet tho [14:52:04] things seem to be hanging, but not sure why [14:52:14] ok nm, [14:52:15] it works [14:52:19] i was just impatien t :) [14:54:04] super :) [14:54:39] so with the new settings the spark driver etc.. will need to try different ports before succeeding (depending on how many people are using spark) [14:54:45] so it might take few seconds more for that [14:55:21] the UI in yarn should work too (the app master one) [14:56:27] thanks luca! [14:59:17] thank you, sorry for the mistake [15:01:14] ottomata: if you have time, can you do on stat1005 [15:01:26] curl localhost:9100/metrics -s | grep gpu | wc -l [15:02:46] 13 [15:02:47] ! [15:03:22] there is something that I don't explain [15:03:29] you and filippo correctly get 13 [15:03:31] I get 0 [15:06:31] 13 makes sense since https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu works [15:08:19] a-team: cancelling standup in favor of staff meeting, let's get together in standup [15:08:51] roger [15:09:05] get together in standup? [15:09:14] in grooming :) [15:09:26] ah! K [15:10:08] 10Analytics, 10Readers-Web-Backlog, 10Mobile: EventLogging Schema errors have increased ~6x - https://phabricator.wikimedia.org/T227018 (10Ottomata) https://logstash.wikimedia.org/goto/6867e74588bcd500bf25871a62496e14 Here are the top 10 offenders: ` select schema, count(*) as cnt from eventerror where ye... [15:17:35] ottomata: found the problem! basically I had http_proxy set in my env, and it was not returning what I wanted [15:17:55] so the curl that you did was probably cached previously by the webproxy? [15:18:09] I mean, I was getting an old copy [15:18:22] while you and filippo were hitting the localhost endpoint [15:19:28] nuria: ta daaan https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu [15:19:56] elukey: po-popommmmmmm!!!! [15:23:17] (03CR) 10Awight: "There's some more simplification that can be done, probably worthwhile." (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) (owner: 10Nuria) [15:23:27] Today I downgraded rocm to 2.5 and tensorflow to 1.13.3, seems working again [15:23:33] but I had to open a bug upstream [15:23:47] those spikes were Miriam testing :D [15:27:31] ottomata: ah no wait I get what was happening, I was curling localhost:9100 on the webproxy host! /o\ [15:27:38] what a disaster [15:27:44] sigh [15:27:51] * elukey cries in a corner [15:28:19] elukey: BLOGPOST [15:28:27] elukey: seriously [15:29:03] we are sort of stable now, moar testing is needed [15:29:11] upgrading to a new rocm version is a huge pain [15:29:28] I had a chat with Miriam to find a set of tests to run every time we upgrade on a canary [15:31:47] (03CR) 10Awight: [C: 04-1] Most special pages should not be pageviews [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) (owner: 10Nuria) [15:36:22] (03CR) 10Awight: [C: 04-1] Most special pages should not be pageviews (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) (owner: 10Nuria) [15:42:07] 10Analytics, 10Readers-Web-Backlog, 10Mobile: EventLogging Schema errors have increased ~6x - https://phabricator.wikimedia.org/T227018 (10Neil_P._Quinn_WMF) >>! In T227018#5324825, @Ottomata wrote: > I see that EditAttemptStep has some null values for fields that should be strings: > https://logstash.wikime... [15:47:12] 10Analytics, 10Datasets-Archiving, 10Research-Backlog: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) [15:47:28] 10Analytics, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10leila) [15:47:52] !log chown -R analytics:analytics /wmf/data/archive/geoip on HDFS [15:47:54] 10Analytics, 10Research-Backlog, 10Article-Recommendation: Make endpoint for top wikis by number of articles - https://phabricator.wikimedia.org/T220673 (10leila) [15:47:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:50:11] 10Analytics, 10Research-Backlog, 10Article-Recommendation, 10Patch-For-Review: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10leila) [15:50:26] 10Analytics, 10Performance-Team, 10Product-Analytics, 10Research-Backlog: Switch mw.user.sessionId back to session-cookie persistence - https://phabricator.wikimedia.org/T223931 (10leila) [15:51:30] 10Analytics: Provide data dumps in the Analytics Data Lake - https://phabricator.wikimedia.org/T186559 (10diego) [15:53:15] 10Analytics, 10Operations, 10Research-Backlog, 10serviceops-radar, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10leila) [15:53:20] 10Analytics, 10Research-Backlog: Evaluate best format to release public data lake as a dump - https://phabricator.wikimedia.org/T224459 (10leila) [15:56:07] 10Analytics, 10Performance-Team, 10Research, 10Security-Team, 10WMF-Legal: A Large-scale Study of Wikipedia Users' Quality of Experience: data release - https://phabricator.wikimedia.org/T217318 (10leila) @JBennett and @JFishback_WMF can you please assign this task to someone on your end so we can make s... [16:04:49] 10Analytics, 10MediaWiki-General-or-Unknown, 10Research-Backlog, 10Wikidata: Improve interlingual links across wikis through Wikidata IDs - https://phabricator.wikimedia.org/T215616 (10leila) [16:05:22] 10Analytics, 10Discovery, 10Operations, 10Research-Backlog: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10leila) [16:07:47] !log sudo chown -R analytics:analytics /srv/geoip/archive/ on stat1007 [16:07:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:08:39] 10Analytics, 10Research-Backlog, 10Wikidata: Copy Wikidata dumps to HDFs - https://phabricator.wikimedia.org/T209655 (10leila) [16:09:18] 10Analytics, 10Analytics-Kanban, 10Research-Backlog: Create labeled dataset for bot identification - https://phabricator.wikimedia.org/T206267 (10leila) [16:12:06] 10Analytics, 10Analytics-EventLogging, 10Research-Backlog: 20K events by a single user in the span of 20 mins - https://phabricator.wikimedia.org/T202539 (10leila) [16:13:50] 10Analytics, 10Research-Backlog: Release edit data lake data as a public json dump /mysql dump, other? - https://phabricator.wikimedia.org/T208612 (10leila) [16:16:01] (03PS5) 10Nuria: Most special pages should not be pageviews [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) [16:18:01] (03CR) 10Nuria: Most special pages should not be pageviews (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) (owner: 10Nuria) [17:01:44] ping ottomata milimetric coming to groskin [17:01:46] ? [17:01:55] Omw [17:13:03] https://aceu19.apachecon.com/session/its-breeze-develop-airflow [17:13:34] milimetric: --^ [17:14:54] 10Analytics, 10Readers-Web-Backlog, 10Mobile: EventLogging Schema errors have increased ~6x - https://phabricator.wikimedia.org/T227018 (10Ottomata) > It's weird that you can't send a null value for a non-required field (and it complicates the instrumentation code a bit too). I don't think it is weird, but... [17:15:34] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys, 10Readers-Web-Backlog (Tracking): QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Jdlrobson) [17:19:09] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 3 others: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 (10Ottomata) @DStrine, I need some help f... [17:48:15] * elukey off! [18:09:49] 10Analytics-Kanban, 10Product-Analytics: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Nuria) Per @ezachte's criteria of "live" wikipedias this data should not include dead/un-editable wikipedias,. Erik's word on this regard below: >>Lei... [18:38:59] nuria: I think there's a problem with the latest geoeditors snapshot [18:39:04] Hey analytics team, hope I'm not still causing problems. Added a "days in ()" to my where clause on stat1007 beeline. [18:39:16] sbassett: you are good i think, no alarms [18:39:19] my query made sense, but the data does not [18:39:26] milimetric: how so? [18:40:02] sbassett: that's ok, but ideally when looking at webrequest if you can limit the time range as much as possible, it saves a lot of resources. The data is HUGE [18:40:12] (not caps-yelling, it's just really big :)) [18:40:27] nuria: well, do this query: select * from geoeditors_monthly where month='2019-03' and wiki_db = 'etwiki'; [18:40:30] milimetric: yeah, just takes time :) [18:40:47] and then repeat for the latest snapshot: select * from geoeditors_monthly where month='2019-06' and wiki_db = 'etwiki'; [18:41:00] and you'll see the differences, there's basically no aggregate over 1 in the latest snapshot [18:41:03] milimetric: etwiki might not be there for both snapshots [18:41:11] milimetric: if it is a small wiki [18:41:18] milimetric: is it? [18:41:24] I looked at eswiki too [18:41:36] (and this is all activity levels, all everything [18:41:58] same for eswiki anyway, so there's something clearly wrong [18:43:08] luckily, just looks like the last snapshot, 2019-05 seems fine [18:43:16] so we still have raw data and everything, I'm investigating [18:44:39] milimetric: i see, yes, just looked and 2019-06 dat looks bad [18:45:52] nuria: I see the problem, we might have missed something in the refactor, the user identification is NULL [18:46:12] so everything for all users aggregates down to one "user" with a "NULL" id [18:46:15] milimetric: ah cause it is null in scoop? [18:46:21] probably, something like that [18:46:32] will look through that flow and submit patches/rerun jobs as needed [18:46:36] good thing we looked! [18:46:39] milimetric: ok, thank you [18:46:48] also, scary :( [18:46:57] milimetric: ENTROPHY [18:46:59] for real [18:47:11] milimetric: this is the precise problem those alarms can solve [18:47:51] yes, totally agreed, but it would have been really hard to figure out which columns to add it to [18:48:13] as a matter of fact, if we figured out this column might be null, we would've just discovered the problem before it happened :) [18:49:17] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country for eswiki - https://phabricator.wikimedia.org/T227809 (10Nuria) [18:49:44] milimetric: no, i would have put the alarm in teh country distribution for say eswiki [18:49:54] milimetric: of distinct 5+ editors [18:49:58] that's fine, country distribution is ok [18:50:07] it's the count of editors per country that's just 1 instead of many [18:50:11] milimetric: not probabilistically [18:50:38] I mean, the entropy of the counts, maybe, yeah [18:50:46] but there's no way I would've thought to do that [18:51:30] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country for eswiki - https://phabricator.wikimedia.org/T227809 (10Nuria) [18:52:08] milimetric: ticket created [18:58:09] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list as needed for iOS - https://phabricator.wikimedia.org/T226849 (10chelsyx) @mforns can you please review the patch? Thanks! [18:59:26] 10Analytics, 10Analytics-EventLogging, 10DBA, 10Operations, 10ops-eqiad: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Cmjohnson) I still need to move the DIMM around ...I need the server taken down. If this needs to be scheduled, please let me kno... [19:05:50] 10Analytics: Bug: geoeditors 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Milimetric) [19:06:53] (03PS1) 10Milimetric: Revert sqoop select for cu_changes [analytics/refinery] - 10https://gerrit.wikimedia.org/r/522169 (https://phabricator.wikimedia.org/T227812) [19:07:11] nuria: mind checking https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/522169/ [19:07:23] I can deploy and rerun sqoop [19:07:33] (I'll test) [19:09:38] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country for eswiki - https://phabricator.wikimedia.org/T227809 (10Nuria) pinging @Asaf so he is aware as user of this data [19:09:44] (03CR) 10Nuria: [C: 03+2] Revert sqoop select for cu_changes [analytics/refinery] - 10https://gerrit.wikimedia.org/r/522169 (https://phabricator.wikimedia.org/T227812) (owner: 10Milimetric) [19:10:21] 10Analytics, 10Operations, 10ops-eqiad: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10Cmjohnson) @elukey I am not sure which disk this? I think it's a smaller ssd? Can you confirm the disk type and size please ? [19:11:07] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country a wiki - https://phabricator.wikimedia.org/T227809 (10Nuria) [19:11:41] 10Analytics, 10Patch-For-Review: Bug: geoeditors 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Nuria) pinging @Asaf so he knows we are rerunning data for June 2019 [19:12:38] 10Analytics, 10Patch-For-Review: Bug: geoeditors 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Nuria) [19:12:40] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country a wiki - https://phabricator.wikimedia.org/T227809 (10Nuria) [19:19:10] (03CR) 10Milimetric: [V: 03+2] Revert sqoop select for cu_changes [analytics/refinery] - 10https://gerrit.wikimedia.org/r/522169 (https://phabricator.wikimedia.org/T227812) (owner: 10Milimetric) [19:20:43] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country a wiki - https://phabricator.wikimedia.org/T227809 (10Ijon) What is "entrophy"? [19:20:57] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country a wiki - https://phabricator.wikimedia.org/T227809 (10Nuria) [19:22:10] 10Analytics, 10Patch-For-Review: Bug: geoeditors (editors per country data) 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Nuria) [19:22:20] 10Analytics, 10Analytics-Data-Quality: Set entrophy alarm in editors per country a wiki - https://phabricator.wikimedia.org/T227809 (10Nuria) sorry @Asaf , wrong ticket, mean to cc you in: https://phabricator.wikimedia.org/T227812 [19:24:01] 10Analytics, 10Patch-For-Review: Bug: geoeditors (editors per country data) 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Ijon) Thanks! [19:29:20] 10Analytics, 10VisualEditor, 10Mobile, 10Readers-Web-Backlog (Tracking): EventLogging Schema errors have increased ~6x - https://phabricator.wikimedia.org/T227018 (10Jdlrobson) I guess we can resolve this then? In summary this spike was due to EditAttemptStep sending null values ? [19:35:19] (03PS3) 10Ottomata: Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) [19:42:47] 10Analytics, 10Analytics-Kanban: Bug: geoeditors (editors per country data) 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Milimetric) [19:51:12] (03PS2) 10Ottomata: Use JsonParser to parse data that starts with { or [, rather than YAMLParser [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) [19:56:33] (03PS3) 10Ottomata: Use JsonParser to parse data that starts with { or [, rather than YAMLParser [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) [20:02:59] (03CR) 10Ottomata: "Tested! This works." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) (owner: 10Ottomata) [20:03:11] (03CR) 10Ottomata: "tested and works." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [20:11:22] 10Analytics, 10VisualEditor, 10Mobile, 10Readers-Web-Backlog (Tracking), 10User-Ryasmeen: EventLogging Schema errors have increased ~6x - https://phabricator.wikimedia.org/T227018 (10Ottomata) 05Open→03Resolved a:03Ottomata agree [20:13:34] ottomata: deploy failed again on an-coord1001 and I need latest refinery there to fix a bug [20:14:16] can I just git pull?! [20:15:18] hm, I guess we don't have any good fix for this problem... I'll just run the job manually I guess [20:17:04] looking [20:18:13] oo deploying there, i just deleted the wrong deploy dir [20:22:53] ok fixed milimetric [20:22:55] it failed for some other reason [20:23:05] i saw a __pycache__ file owned as root [20:23:12] that it couldn't remove [20:23:15] not sure why tho [20:23:18] ... [20:23:23] ok, thanks! [20:23:23] i removed it and it deployed properly [20:23:30] makes me feel very unsafe tho [20:24:13] specifically /srv/deployment/analytics/refinery-cache/revs/ffa4931e5b8271e4005241ccbfd822e2f53d3c6e/python/refinery/__pycache__/sqoop.cpython-35.pyc [20:24:24] maybe someone ran sqoop as root it wrote out that file? [20:30:32] PROBLEM - Check the last execution of refine_eventlogging_eventbus_job_queue on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_eventbus_job_queue https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:30:32] PROBLEM - Check the last execution of refine_mediawiki_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_mediawiki_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:31:54] !log resized /srv on an-coord1001 from 60G to 115G - T227132 [20:31:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:31:57] T227132: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 [20:36:21] 10Analytics, 10Analytics-Kanban, 10Release-Engineering-Team: issues with artifact cache in an-coord1001 - https://phabricator.wikimedia.org/T227132 (10Ottomata) I believe this happens on an-coord1001 and notebook* hosts because their /srv partitions are relatively small. When the disk fills up during scap d... [20:46:44] PROBLEM - Check the last execution of refine_eventlogging_eventbus on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_eventbus https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:47:50] (03CR) 10Nuria: [C: 04-1] Use JsonParser to parse data that starts with { or [, rather than YAMLParser (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) (owner: 10Ottomata) [20:49:48] (03PS4) 10Ottomata: Use JsonParser to parse data that starts with { or [, rather than YAMLParser [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) [20:53:25] (03CR) 10Nuria: [C: 03+2] Use JsonParser to parse data that starts with { or [, rather than YAMLParser [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) (owner: 10Ottomata) [20:59:02] (03Merged) 10jenkins-bot: Use JsonParser to parse data that starts with { or [, rather than YAMLParser [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521552 (https://phabricator.wikimedia.org/T227484) (owner: 10Ottomata) [20:59:34] thanks nuria! [20:59:37] did you see my other change too? [21:00:25] ottomata: yes i am STILL reading the 10+ lines comment [21:00:31] ottomata: ayayaya [21:00:46] haha [21:00:50] ottomata: i feel i need to catch up to joseph to truly understand what is going on [21:01:00] ottomata: ayayaya [21:02:26] 10Analytics, 10Analytics-Kanban, 10ExternalGuidance, 10Product-Analytics, 10Patch-For-Review: [Bug] `init` and `mtinfo` event counts drop drastically since June 17 2019 - https://phabricator.wikimedia.org/T227150 (10Nuria) My mistake, i had refined from 17th onward. All data should be there by now. [21:02:55] - get jsonschema [21:02:55] - get table schema [21:02:55] - merge jsonschema into table schema [21:02:55] - use merged schema to read json data [21:04:33] ottomata: and the dropping of columns? [21:06:15] ottomata: that is what i do not understand, why the dropping of geo column? [21:06:20] PROBLEM - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:07:39] nuria: yeah that took me a minute to figure out [21:07:41] ottomata: root as in the user "root" or "analytics"? [21:07:50] because the hive table has a geocoded_data column [21:07:57] I'm definitely running scripts that use sqoop.py as the analytics user [21:08:05] so by the time the transform function that adds geocoded_data runs [21:08:07] but I'm definitely not running anything as root and neither should anyone else [21:08:08] PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:08:10] the df it is passed already has it [21:08:12] (all nulled) [21:08:37] milimetric: root [21:08:43] yeah... that's crazy [21:08:59] something's new/wrong there on that machine [21:09:00] maybe there was an accidental sudo without -u analytics? dunno. [21:09:17] I don't have rights to do that, so maybe, but I can't think of anyone else that would even want to do that [21:09:53] will all these an-coord problems fix themselves automatically or do we have to restart jobs? [21:10:12] PROBLEM - Check the last execution of eventlogging_to_druid_readingdepth_hourly on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_readingdepth_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:10:26] PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:10:26] milimetric: pretty sure they shhoudl fix themselves [21:10:42] at least, the refine ones will [21:11:05] yeah, but some el-druid ones are failing [21:12:09] looking [21:14:40] oof no [21:14:41] they won't. [21:14:58] https://phabricator.wikimedia.org/T207207 [21:16:09] ok they will [21:16:11] there are 2 jobs [21:16:14] one immediate [21:16:20] and another daily that fills holes [21:17:27] hmm, if i run them now i think it will just fix. trying [21:18:17] !log rerunning /usr/local/bin/eventlogging_to_druid_prefupdate_hourly [21:18:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:18:58] HMMM ok maybe deploy did not work propelry [21:19:05] ...??? [21:21:57] forcing a deploy to an-coord, i think it didin't complete properly with git fat stuff, but it thought it had. [21:22:39] ottomata: k [21:22:59] cc milimetric [21:24:07] ottomata: I’m running sqoop now on an-coord [21:24:11] ok [21:24:13] as analytics [21:25:49] ok looks better now, [21:26:42] !log rerunning eventlogging_to_druid_navigationtiming_hourly [21:26:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:28:54] !log rerunning eventlogging_to_druid_readingdepth_hourly [21:28:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:30:42] ottomata: ok, once that settles we can talk about cr again and drop/add column [21:30:55] nuria: it is 'settling' :) [21:31:01] i can multitask [21:32:07] ottomata: multistaking while deploying seems not the BEST idea.. just saying [21:33:16] RECOVERY - Check the last execution of refine_eventlogging_eventbus_job_queue on an-coord1001 is OK: OK: Status of the systemd unit refine_eventlogging_eventbus_job_queue https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:35:10] ottomata: maybe bc and you can explain why teh column needs to be dropped if present (it will always be present after table creation) [21:39:42] let's bc! [21:39:44] nuria: [21:39:50] ottomata: k [21:49:03] RECOVERY - Check the last execution of refine_eventlogging_eventbus on an-coord1001 is OK: OK: Status of the systemd unit refine_eventlogging_eventbus https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:59:20] ok, so it seems that all the eventlogging_to_druid_* timers are still not ok, and I'm guessing we need to do reset-state on them? [22:00:53] well, i jsut reran the job, not the timer [22:00:53] itself [22:01:01] the next time it execs (hourly) it should be fine [22:01:55] RECOVERY - Check the last execution of eventlogging_to_druid_readingdepth_hourly on an-coord1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_readingdepth_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:02:11] RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-coord1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:03:42] milimetric: https://docs.databricks.com/delta/mysql-delta.html [22:08:19] RECOVERY - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-coord1001 is OK: OK: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:09:38] (03PS4) 10Nuria: Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [22:09:49] (03CR) 10jerkins-bot: [V: 04-1] Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [22:09:51] (03CR) 10Nuria: [C: 03+2] Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [22:10:07] RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-coord1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:10:09] (03CR) 10Nuria: [C: 03+1] Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [22:10:14] ottomata: that's cool but requires tables to have primary keys [22:10:17] I think debezium doesn't: https://debezium.io/docs/connectors/mysql/#deploying-a-connector [22:10:27] ...mw tables don't hahve primary keys?! [22:10:30] at least that's what I remember from reading it last time [22:10:37] they do, but we pull from the views in labs [22:10:44] oh [22:10:45] ottomata: just pushed comment , amend as needed [22:10:46] so if we pull from production, this is fine [22:11:00] this would be from binlogs, so views wouldn't work ya [22:11:19] debezium is binlog too, no? [22:11:23] oh, that's true, so we have to do it from prod anyway, even with debezium, then yeah it would work [22:11:26] ya [22:11:42] it does seem like debezium is a lot more mature and has thought about a ton of things [22:11:54] just reading through that connector over the past few years it's grown 10x [22:12:03] ya perhaps, but this is not much more than a spark streaming job [22:13:22] oo merge conflict... [22:13:48] ottomata: with? [22:14:02] hah chhangelog [22:14:32] (03PS5) 10Ottomata: Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) [22:14:37] ottomata: i think the way to avoid that one is not to add the jar version at top [22:14:48] oh? [22:14:56] (03CR) 10Nuria: [C: 03+2] Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [22:20:37] (03Merged) 10jenkins-bot: Merge input JSONSchema with Hive schema before using it to read raw input JSON data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/521563 (https://phabricator.wikimedia.org/T227088) (owner: 10Ottomata) [22:24:48] RECOVERY - Check the last execution of refine_mediawiki_events on an-coord1001 is OK: OK: Status of the systemd unit refine_mediawiki_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:48:36] (03PS1) 10Milimetric: Fix example in docs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/522207 [22:49:08] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix example in docs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/522207 (owner: 10Milimetric) [22:59:07] milimetric: were you able to re-run scoop? [22:59:31] nuria: I reran sqoop, and now I'm manually running the rest of the pipeline [22:59:32] load - done [22:59:40] milimetric: k [22:59:46] monthly - 1/3 done [22:59:49] druid - to do [23:02:27] milimetric: ok, thank you for taking care of those [23:02:38] np [23:44:07] nuria: ok, data's ok, loading into druid now [23:44:14] nuria: btw, thai wikipedia is a good one [23:44:17] thwiki [23:57:45] 10Analytics, 10Analytics-Kanban: Bug: geoeditors (editors per country data) 2019-06 snapshot broken - https://phabricator.wikimedia.org/T227812 (10Milimetric) All jobs in the pipeline have been rerun, fresh data is accessible from superset and everywhere else. Apologies for the inconvenience. For reference,...