[00:31:55] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) The query was executing for too long then. [04:01:33] (03CR) 10Nuria: [C: 04-1] "I think you need to submit next patch?" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/577531 (https://phabricator.wikimedia.org/T212032) (owner: 10Fdans) [04:03:42] (03CR) 10Nuria: [C: 03+2] "Looks good, let's please make sure to update train etherpad once merged as job needs to be restarted" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/577220 (https://phabricator.wikimedia.org/T246309) (owner: 10Fdans) [04:03:44] (03CR) 10Nuria: [V: 03+2 C: 03+2] Set access method to value passed in VirtualPageview event [analytics/refinery] - 10https://gerrit.wikimedia.org/r/577220 (https://phabricator.wikimedia.org/T246309) (owner: 10Fdans) [05:52:49] 10Analytics, 10Product-Analytics (Kanban): SQL definition for wikidata metrics for tunning session - https://phabricator.wikimedia.org/T247099 (10jwang) A few ideas for our discussion. **Metric definition. (Proposed)** A X% increase in wikidata used across wikis. The increase can be measured from different... [06:56:45] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) >>! In T246970#5952464, @Mike_Peel wrote: > I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things ar... [07:13:42] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) https://librenms.wikimedia.org/graphs/to=1584082800/id=12085 https://librenms.wikimedia.org/device/device=149/tab=port/port=12086/ stat1005 and kafka-jumbo1006 are in the sa... [07:49:30] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I did some tests and the two hosts are definitely related. I logged as root on both via mgmt console and turned off their interfaces, and the stat1005's broadcast traffic wen... [08:32:27] good morning :) [08:32:42] so kafka-jumbo1001 is still down from yesterday evening, together with stat1005 [08:33:13] I think that something weird happened on the cabling side [08:34:27] hosts on the same rack/switch etc.. [08:38:01] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) The other host in D1, the rack of stat1005 and jumbo1006 is kafka-jumbo1008, one of the new ones: https://netbox.wikimedia.org/dcim/devices/2510/ [08:56:22] 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey) [09:42:21] (03PS1) 10Elukey: Downgrade toree to 0.2.0 for Buster [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/579522 (https://phabricator.wikimedia.org/T245179) [09:43:38] (03CR) 10Elukey: [V: 03+2 C: 03+2] Downgrade toree to 0.2.0 for Buster [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/579522 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [09:53:48] mforns: o/ [09:54:04] when you have a moment can you do a test for spark/yarn on stat1008? [09:58:34] joal: o/ I downgraded toree to 0.2.0 for Buster as well, so we can check if it was the culprit or not.. [10:51:18] 10Analytics, 10Operations, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) 05Open→03Resolved We added jupyterhub to stat1004 and stat1006 and we'll move people with big homes, it should help long term. [11:08:41] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I checked the last changes happened yesterday on the switch via: ` elukey@asw2-d-eqiad> show system rollback compare 3 0 [edit interfaces interface-range vlan-private1-d-eqi... [11:53:35] going afk for lunch! [12:03:15] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10akosiaris) I 've had a look as well. I 've checked that the mac address of kafka-jumbo1006 is indeed the one the switch learns and indeed that's true. I 've bounced the port as well... [12:18:14] !log Restart cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2020-3-12 [12:18:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:21:48] 10Analytics, 10Analytics-Kanban, 10serviceops: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10akosiaris) 05Open→03Stalled >>! In T242861#5941623, @Ottomata wrote: > @akosiaris for my purposes I'm satisfied, but I'm not sure... [12:23:43] (03CR) 10Joal: [C: 04-1] Add wikimania.wikimedia.org to the whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579300 (https://phabricator.wikimedia.org/T216525) (owner: 10Milimetric) [12:26:54] (03CR) 10Joal: [C: 04-1] Add wikimania.wikimedia.org to the whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579300 (https://phabricator.wikimedia.org/T216525) (owner: 10Milimetric) [12:34:15] o/ [12:34:22] hi milimetric [12:34:30] (looking at review, thx Jo) [12:34:34] ;) [12:36:43] (03PS2) 10Milimetric: Add wikimania.wikimedia.org to the whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579300 (https://phabricator.wikimedia.org/T216525) [12:36:45] (03CR) 10Milimetric: Add wikimania.wikimedia.org to the whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579300 (https://phabricator.wikimedia.org/T216525) (owner: 10Milimetric) [12:37:49] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579300 (https://phabricator.wikimedia.org/T216525) (owner: 10Milimetric) [13:00:24] joal: do you know what happened to the geoeditors yearly job and have a sec to explain? Like, oozie itself timed out? [13:00:44] (https://hue.wikimedia.org/oozie/list_oozie_coordinator/0000480-191216160148723-oozie-oozi-C/) [13:00:53] I don't know milimetric actually [13:01:03] ok, cool, I'll investigate [13:01:48] milimetric: my assumption is that the job timed-out becasue of oozie 3-month retention period (jobs are kept in DB 3 month) [13:03:01] right, that was my theory though I had no idea what the timeout was. I'll look to see if I can set it for the individual job and at this point this job is such a pain in the butt I'm leaning towards just doing it with a calendar invite for myself :) [13:04:53] milimetric: could very well be :)P [13:05:46] milimetric: or we make that job monthly with a conditional step actioning the hive part only for january [13:30:05] 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) At 12:16 https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/579531/ was merged, rolling b... [13:31:56] ok, joal, here's what I found: oozie.service.coord.default.max.timeout is set to 60 days in our oozie config. The coordinator instance that timed out was created on 2020-01-01, so that means the timeout message should've come on 2020-03-01, which sounds like what happened. Right now the timeout is set to -1 (by this, right? [13:31:56] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/yearly/coordinator.xml#L42). This seems to mean "use the default". I'm not sure if setting the timeout in coordinator.xml will allow it to exceed the default.max.timeout because that's the worst variable name ever - is it DEFAULT or MAX, those are almost opposite things. So, I propose to set the timeout and wait and see. [13:36:19] works for me milimetric :) [13:36:34] thanks for looking into this [14:00:14] 10Analytics, 10Analytics-Kanban, 10ArticlePlaceholder, 10Wikidata, and 4 others: ArticlePlaceholder dashboard stopped tracking page views - https://phabricator.wikimedia.org/T236895 (10Lydia_Pintscher) The dashboard still doesn't update unfortunately. Is there anything else that needs to be done? [14:02:51] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10Ottomata) I think there are still some issues, but most things seem to be working fine. There is a periodic [[ https://grafa... [14:12:35] 10Analytics, 10Analytics-Kanban, 10ArticlePlaceholder, 10Wikidata, and 4 others: ArticlePlaceholder dashboard stopped tracking page views - https://phabricator.wikimedia.org/T236895 (10JAllemandou) Patch needs to be deployed before the dashboard shows data. [14:24:07] 10Analytics, 10Analytics-Kanban, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10akosiaris) While things do indeed look way better, the memory leak is most certainly still there. Looking at https://grafana.... [14:41:49] joal: just fyi, about incremental checkpoints https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html [14:55:19] (03PS1) 10Milimetric: Increase yearly coordinator timeout to 370 days [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579577 (https://phabricator.wikimedia.org/T246753) [14:56:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: geoeditors-yearly job times out - https://phabricator.wikimedia.org/T246753 (10Milimetric) As noted in the code, if the change to the timeout doesn't work, this job will have a TIMEOUT status again in 60 days. At that point we'd need to change oozie's def... [15:07:37] (03CR) 10Milimetric: [C: 03+2] Avoid generating a full build with each language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/579241 (https://phabricator.wikimedia.org/T246778) (owner: 10Fdans) [15:08:53] (03Merged) 10jenkins-bot: Avoid generating a full build with each language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/579241 (https://phabricator.wikimedia.org/T246778) (owner: 10Fdans) [15:09:06] hey teamm :] [15:09:13] elukey, I got a ping from you [15:09:32] oh, stat1008 will do [15:09:45] mforns: hola! whenever you have time, no rush [15:10:43] 10Analytics, 10Product-Analytics, 10Inuka-Team (Kanban): Set up pageview counting for KaiOS app - https://phabricator.wikimedia.org/T244547 (10Nuria) I see, we will go ahead in our end and: 1) we will count as pageviews the https://en.wikipedia.org/api/rest_v1/page/mobile-sections/ requests 2) reques... [15:11:03] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Inuka-Team (Kanban): Set up pageview counting for KaiOS app - https://phabricator.wikimedia.org/T244547 (10Nuria) a:03Milimetric [15:11:17] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Inuka-Team (Kanban): Set up pageview counting for KaiOS app - https://phabricator.wikimedia.org/T244547 (10Nuria) [15:13:49] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/579577 (https://phabricator.wikimedia.org/T246753) (owner: 10Milimetric) [15:19:41] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Inuka-Team (Kanban): Set up pageview counting for KaiOS app - https://phabricator.wikimedia.org/T244547 (10Milimetric) p:05Triage→03High [15:22:11] hey, I cant ssh into stat1005 and saw that it has been down for some time. do you have any updates or estimates of how long it could take to be up again? [15:22:29] mgerlach: there is a network problem, we don't really know at this point :( [15:22:36] we are working with SRE [15:23:02] elukey: thanks : ( [15:39:52] 10Analytics, 10ContentTranslation, 10Language-Team (Language-2020-January-March): Test Performance of Marian NMT translation in stat cluster - https://phabricator.wikimedia.org/T247245 (10MoritzMuehlenhoff) @santhosh When you've setup your test environment and want to test OpenBLAS optimised for the CPU arc... [15:57:10] 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) p:05High→03Medium We are going to leave it like this for the weekend. codfw mw hosts talking to loc... [16:39:36] elukey, scala spark YARN engine in stat1008 fails for me... it doesn't even start [16:39:42] looking at the logs [16:42:05] interesting [16:42:06] again org.zeromq.ZMQException: Errno 48 : Address already in use [16:42:56] but I see in your venv toree 0.2.0, that should be the right version [16:43:04] mforns: can you tell me exactly how to repro? [16:43:28] elukey, I open jupyterlab through 1008 [16:43:33] then log in [16:43:38] create a terminal tab and kinit [16:43:48] create a scala spark YARN tab [16:43:57] spark.sql("select 1") [16:44:30] mforns: can you try a query that works? [16:44:36] xD [16:44:39] this one should work! [16:44:48] yes sorry I mean another query [16:44:52] ok [16:45:52] elukey, just tried: spark.sql("select * from event.navigationtiming where year=2020 and month=3 and day=1 and hour=0 limit 10") [16:49:59] let me know if/when it finishes, I don't see errors for now [16:50:20] On Hue, I'm running into the error message "Error while compiling statement: FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient" whenever I attempt to query and it also doesn't show the tables on the left panel. I'm assuming this has to do with Kerberos? [16:51:16] lexnasser: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue - there is a workaround for it [16:51:39] elukey: Perfect, thanks! [16:51:51] elukey, it gave the same error [16:52:03] the one at 16:51 [16:52:05] mforns: what error does it return in the notebook? [16:52:11] the one in the logs? org.zeromq.ZMQException: Errno 48 : Address already in use [16:55:07] mforns: ok I may have an idea [16:55:16] the issue seems to be https://issues.apache.org/jira/browse/TOREE-485 [16:55:39] elukey, the notebook doesn't say anyhting [16:55:46] just ignores my command [16:56:15] the error you pasted looks the same [16:56:26] the issue referts to jupyterlab 0.34, and we had jupyterlab==0.32.1 before the upgrade in pip's frozen requirements [16:56:53] so what I think is happening is that the kernerls created via toree 0.2.0 are not working with the new version of jupyterlab [16:57:00] aha [16:57:10] but you don't have that problem right? [16:57:12] so I'll need to re-build them with Toree 0.3.0 [16:57:17] yeah I don't for some reason [16:57:25] weird [17:04:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on icinga1001 is CRITICAL: 1.24e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [17:08:30] I am roll restarting mirror maker on jumbo to see if it fixes it [17:08:48] k [17:12:15] need to step away for 20/30 mins [18:09:24] 10Analytics, 10wmfdata-python, 10Product-Analytics (Kanban): Update wmfdata to support multiple SQL engines for Hive databases - https://phabricator.wikimedia.org/T246060 (10nshahquinn-wmf) 05Open→03Resolved We've released [version 1.0](https://github.com/neilpquinn/wmfdata/blob/master/CHANGELOG.md#100-1... [18:09:26] 10Analytics, 10wmfdata-python, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10nshahquinn-wmf) [18:12:11] 10Analytics, 10Analytics-Kanban, 10Epic, 10Product-Analytics (Kanban): Spark sessions can provision kerberos tickets in a more predictable manner - https://phabricator.wikimedia.org/T246132 (10nshahquinn-wmf) > Right now kerberos tickets are expiring after 24 hours, this leaves spark sessions hanging after... [18:15:35] milimetric: Is there any reason why the current geoeditors bucketed Oozie job doesn't have an `sla_alert_contact` in coordinator.properties? [18:15:36] 10Analytics, 10wmfdata-python, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10nshahquinn-wmf) 05Open→03Resolved Now that wmfdata 1.0 has been released with support for sensible Spark set... [18:17:27] lexnasser: I’m not sure if that job has an sla, there’s nothing that has a hard dependency on it. But now with the API you’re making there would be so feel free to add one [18:19:07] milimetric: What email does the `send_error_email` sub-workflow send to? [18:21:00] milimetric: Disregard the above question, looks like it sends to analytics-alerts@wikimedia.org. for testing should I remove the sub workflow so it doesn't notify on error? [18:22:09] lexnasser: the easiest way is to clone your change from gerrit and modify the default setting, wanna start our 1/1 a little early and I can show you? [18:23:03] milimetric: I think I understand, but I'll hop on now [18:31:49] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: Provide Python 3.6+ on SWAP - https://phabricator.wikimedia.org/T212591 (10nshahquinn-wmf) Bokeh 2.0, which was recently released, [requires Python 3.6 or higher](https://docs.bokeh.org/en/latest/docs/releases.html#migration-guide) as well. [18:46:11] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: Provide Python 3.6+ on SWAP - https://phabricator.wikimedia.org/T212591 (10elukey) >>! In T212591#5474939, @Ottomata wrote: >> You can use python3.7 in your venv when you create the notebook > All of the default python notebook kernels we make available in... [19:11:26] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) Chris moved the servers to different ports, and for kafka-jumbo1006 it helped, since it is now serving traffic. stat1005 is still suffering of the same issue though. [19:12:20] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 34 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [19:12:53] 10Analytics, 10Inuka-Team, 10Product-Analytics: Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10nshahquinn-wmf) Because the KaiOS app is an experimental project and the team is pushing to get the initial release ready as soon as possible, they have decided to defer the Vi... [19:12:53] elukey: Let me know if I can help with kafka [19:13:03] 10Analytics, 10Inuka-Team, 10Product-Analytics: Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10nshahquinn-wmf) [19:13:48] joal: thanks! things seems under control on the kafka side, but stat1005 is still down [19:13:56] :( [19:13:56] will restart working on it on monday [19:14:14] elukey: didn't follow stat1005 - notebooks stuff right? [19:14:47] joal: nono it went out of network with kafka-jumbo1006 [19:14:56] Oh didn't get that [19:15:00] ok [19:15:01] we tried to check a bazillion settings but so far nothing [19:15:05] very weird [19:15:07] :( [19:15:16] the urgent part was jumbo :) [19:15:22] kafka was indeed more important [19:15:36] thanks a lot for caring elukey <3 [19:15:44] all right have a good weekend folks! :) [19:15:46] * elukey off! [19:21:41] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Inuka-Team (Kanban): Set up pageview counting for KaiOS app - https://phabricator.wikimedia.org/T244547 (10nshahquinn-wmf) >>! In T244547#5967474, @Nuria wrote: > I see, we will go ahead in our end and: > > 1) we will count as pageviews the https://e... [19:31:31] 10Analytics, 10Analytics-SWAP, 10Product-Analytics: Provide Python 3.6+ on SWAP - https://phabricator.wikimedia.org/T212591 (10Ottomata) Likely the SWAP wheels that you've built for buster use Python 3.7, right? So, when folks log into it on a Buster node, they will get a copy of that Python 3.7 venv. I lo... [21:43:22] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Dzahn) Fixed by @Papaul for kafka-jumbo1006. We saw recoveries for kafka lag on other machines all at once. [21:59:46] 10Analytics, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Papaul) We have wrong mgmt password on all 3 nodes [22:07:35] 10Analytics, 10DC-Ops, 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Papaul) I have @Jclark-ctr repalce the cable to stat1005 same issue. I have him also disconnect the cable while i was looking at the switch the interface went from up up to up down a... [22:41:39] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Contribution inequality graphs for Wikistats - https://phabricator.wikimedia.org/T195033 (10Quasipodo) You sure is that expensive? In WikiChron we don't have big resources and we are able to compute the Gini coefficient on very large wikis. It just a m... [22:52:38] 10Analytics, 10Analytics-Kanban: Support CSV uploads in Superset - https://phabricator.wikimedia.org/T245679 (10EYener) Hi @Nuria and all, we're ready to try a 'mock' data set as well. Can someone point me toward instructions on accessing and utilizing the staging environment so that I can get started with the... [22:53:47] 10Analytics, 10Analytics-Kanban: Support CSV uploads in Superset - https://phabricator.wikimedia.org/T245679 (10Nuria) @EYener CVS uploads are enabled on http://superset.wikimedia.org so no special access needed [22:57:33] 10Analytics, 10Analytics-Kanban: Support CSV uploads in Superset - https://phabricator.wikimedia.org/T245679 (10EYener) Thank you, @Nuria ! It works seamlessly.