[06:27:57] 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) >>! In T227288#5307228, @MoritzMuehlenhoff wrote: > Should these really be both in eqiad? The initial use case is for analytics, but we migh... [06:58:20] 10Analytics, 10ChangeProp, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (next): Enable controlled debug logging for change-prop - https://phabricator.wikimedia.org/T189621 (10elukey) [06:59:17] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274 (10elukey) [06:59:28] 10Analytics, 10Product-Analytics, 10Epic: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10elukey) [07:01:10] 10Analytics, 10EventBus, 10Operations, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048 (10elukey) 05Open→03Declined Eventbus is on its road to decommission in favor of event-gate, I'd close this task since probably not relevant any... [07:19:56] 10Analytics, 10Operations, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Tried to check in /var/log/apt/history the packages installed to make the Tensorflow and Thumbor (uses OpenCL) use case working: ` cxlactivitylogger hcc hsa-rocr-dev... [07:33:38] 10Analytics, 10Operations, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Also there seems to be some movement in Debian for rocm: https://lists.debian.org/debian-devel/2019/06/msg00302.html [08:16:00] !log forced manual run of refinery-druid-drop-public-snapshots.service on an-coord1001 [08:16:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:31:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Dropping data from druid takes down aqs hosts - https://phabricator.wikimedia.org/T226035 (10elukey) p:05High→03Normal Triggered manually the timer to drop the segment, that deleted 2019-02 from the public cluster. I did notice a very... [08:58:16] dsaez: o/ are you online by any chance? [08:58:46] I'd need to ask you some details about the jobs running on stat1007 (nothing on fire, I am investigating why a lot of ports are opened) [09:07:00] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826 (10elukey) I can see from https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#firewall-setup that high po... [09:47:41] 10Analytics, 10Product-Analytics, 10Epic: Add wikidata ids to data lake tables - https://phabricator.wikimedia.org/T221890 (10elukey) [09:48:16] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Backlog (Later), 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10elukey) [10:03:31] * elukey errand + lunch! [10:19:46] elukey, here. [10:19:57] jobs on stat1007?! let me see [10:21:21] elukey, I killed one, the other are gosth process, you can kill of them if you want [12:02:27] 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) >>! In T227288#5307686, @elukey wrote: > This is a very good point. Would we have only one KDC per datacenter? I think having o... [12:02:40] 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:05:17] 10Analytics, 10Operations, 10Traffic: Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source - https://phabricator.wikimedia.org/T225786 (10ema) [12:50:18] dsaez: hey there! Sorry I missed you before going afk.. no issues with your processes, just wanted to know if you need to access them from another stat/notebook machine [12:50:35] since they open network ports (see my email to analytics@ for more info..) [12:52:40] 10Analytics, 10Analytics-Kanban: Decide: start_timestamp for mediawiki history - https://phabricator.wikimedia.org/T220507 (10Milimetric) well, in most cases the "create" event is actually a create event, just not explicitly logged that way in the logging table because that type of event wasn't added until 201... [12:59:14] 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) Makes sense, the extra latency to codfw shouldn't be a big deal. I know that we need to have only one kadmin server, but I was thinking abou... [13:18:01] elukey: so the new mediawiki history snapshot is done, checked, and I'm just running manual checks but they look good so far [13:18:16] we could deploy now, or wait to do more queries out of an abundance of caution [13:18:41] but obviously it's Friday, so what are your thoughts? [13:22:11] milimetric: morning! [13:22:16] morning! [13:22:32] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826 (10mpopov) How would this impact the ability to install Python and R packages? Will `export https_proxy=http://webproxy.eq... [13:22:41] the deployment is relatively easy, and the rollback takes very little, so I'd say that we coud go ahead [13:25:29] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826 (10elukey) >>! In T170826#5308482, @mpopov wrote: > How would this impact the ability to install Python and R packages? Wi... [13:26:14] milimetric: do you want me to create a patch? [13:26:36] elukey: no prob, I got it, was just checking something [13:26:40] I'll make a patch [13:31:27] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520884/ [13:39:38] milimetric: change deployed on aqs, but only applied to aqs1004 [13:39:41] that is depooled now [13:39:49] so you can test it to check that everything is good [13:39:57] sure, will curl a bit [13:40:02] and if so, I'll repool it and update the other nodes [13:43:14] looks sensible to me elukey [13:43:33] responding to different types of endpoints, data updated in a way that makes sense [13:43:47] all right, let's do it, ok? [13:44:18] elukey: yeah, all good for me [13:47:52] milimetric: done! [13:48:05] ok, great, will test wikistats [13:48:59] hm, I guess there's no way to force it to bust the varnish cache so I have to wait [13:49:13] I can see druid's brokers already caching the new snapshot [13:49:22] https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&orgId=1 [13:51:25] hm, interesting, i never really understood this cache path [13:51:39] wikistats still shows old results, but druid's clearly getting cache misses [13:52:49] well there is the AQS API [13:53:16] I guess that traffic hitting it directly should get the new stuff [13:55:00] i thought everything had to go through varnish to get to Druid, and nothing was direct [13:55:05] but yeah, direct traffic would make sense [14:07:17] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Enable encryption and authentication for TLS-based Hadoop services - https://phabricator.wikimedia.org/T217412 (10elukey) https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html > Client Certificates > U... [14:29:02] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Enable encryption and authentication for TLS-based Hadoop services - https://phabricator.wikimedia.org/T217412 (10elukey) Just tried to test it doing some curl to GET /mapOutput?job=job_1561367702623_49144&reduce=0&map=attempt_1561367702623_49144_m_000023_0 HTT... [14:29:11] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Enable encryption and authentication for TLS-based Hadoop services - https://phabricator.wikimedia.org/T217412 (10elukey) [14:46:01] hm, still cached https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-projects/all-editor-types/all-page-types/monthly/2017060100/2019070500 [14:51:00] (03PS1) 10Ladsgroup: Fix loading of the bootstrap file [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520893 (https://phabricator.wikimedia.org/T218903) [14:55:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix loading of the bootstrap file [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520893 (https://phabricator.wikimedia.org/T218903) (owner: 10Ladsgroup) [14:55:44] (03Merged) 10jenkins-bot: Fix loading of the bootstrap file [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520893 (https://phabricator.wikimedia.org/T218903) (owner: 10Ladsgroup) [14:56:07] (03PS1) 10Ladsgroup: Fix loading of the bootstrap file [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520895 (https://phabricator.wikimedia.org/T218903) [14:56:59] (03CR) 10Ladsgroup: [C: 03+2] Fix loading of the bootstrap file [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520895 (https://phabricator.wikimedia.org/T218903) (owner: 10Ladsgroup) [14:57:41] (03Merged) 10jenkins-bot: Fix loading of the bootstrap file [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520895 (https://phabricator.wikimedia.org/T218903) (owner: 10Ladsgroup) [14:59:55] it seems that Chrome is caching it, and there's no way to convince it not to [15:00:02] curl gives me the latest, so it must be [15:04:09] well, Druid seems to have already cached everything it needs, and new data looks good everywhere I looked so I'm gonna declare a success. I added a little note about caching to the documentation for updating the snapshot [15:08:20] super [15:08:42] let's also update the mailing list so people are aware [15:17:49] (03PS1) 10Ladsgroup: Use config for wdqs host name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520903 (https://phabricator.wikimedia.org/T218710) [15:18:33] (03CR) 10jerkins-bot: [V: 04-1] Use config for wdqs host name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520903 (https://phabricator.wikimedia.org/T218710) (owner: 10Ladsgroup) [15:40:42] 10Analytics, 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: ORES hook integration with EventBus - https://phabricator.wikimedia.org/T201869 (10WDoranWMF) [15:40:46] 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Services (next): EventBusRCFeedFormatter should clean up events from nulls - https://phabricator.wikimedia.org/T216567 (10WDoranWMF) [15:45:32] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 3 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10WDoranWMF) [15:46:36] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10WDoranWMF) [15:52:13] 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) https://wikitech.wikimedia.org/wiki/Reprepro#Adding_a_new_external_repository I think that the key used (http://repo.radeon.com/rocm/apt/debian/... [16:09:45] 10Analytics, 10Analytics-Kanban, 10good first bug: Reportupdater: do not write execution control files in source directories - https://phabricator.wikimedia.org/T173604 (10Nuria) [16:09:54] 10Analytics, 10good first bug: Reportupdater: do not write execution control files in source directories - https://phabricator.wikimedia.org/T173604 (10Nuria) [16:10:37] * joal claps for a successful new snapshot :) [16:10:47] Thanks a milion milimetric and elukey :) [16:24:15] elukey: thanks for email on stat machine backups! question raised by dsaez: what is the RAID level for these machines (trying to get a sense of how likely a failure is)? [16:27:20] isaacj: hi! IIRC we have raid 10, spanning multiple disks, but I'd need to check [16:27:28] it is not a simple raid 1 [16:29:31] isaacj: just checked, raid10 over 4 disks [16:29:39] (the srv partition, where the homes are) [16:30:09] so one disk failure is tolerable, more might be problematic if we are unlucky [16:30:12] I understand my mistake with the cache Nuria, because curl doesn’t have a cache of course, I’ll update the note [16:30:22] thanks elukey ! awesome, so that really will be fine unless there is a massive earthquake essentially [16:30:38] * milimetric should not try to think when sick [16:33:01] elukey, are the datacenters located in places with a lot of earthquakes? I hope they are not in Indonesia, Chile, Mexico or California :D [16:37:49] dsaez: the stat/notebooks are in Virginia :D [16:38:33] * elukey off! [16:41:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Dropping data from druid takes down aqs hosts - https://phabricator.wikimedia.org/T226035 (10Nuria) At this point I think we can close this ticket and re-open if it is to happen again? [16:56:24] (03PS4) 10Nuria: Most special pages should not be pageviews [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) [16:57:20] (03CR) 10Nuria: "Please see next patch." (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/520671 (https://phabricator.wikimedia.org/T226730) (owner: 10Nuria) [17:02:10] (03PS2) 10Ladsgroup: Use config for wdqs host name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520903 (https://phabricator.wikimedia.org/T218710) [17:22:39] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): EventLogging needs to enque events to avoid draining users' battery on mobile - https://phabricator.wikimedia.org/T225578 (10Nuria) a:03Milimetric [17:22:54] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): EventLogging needs to enque events to avoid draining users' battery on mobile - https://phabricator.wikimedia.org/T225578 (10Nuria) Ping @gilles: work on this to start second week of July [17:23:24] (03PS1) 10Ladsgroup: Remove action counter in apiLogScanner.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/520918 (https://phabricator.wikimedia.org/T226292) [17:41:54] 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Kanban (Team 2), and 2 others: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10WDoranWMF) [18:17:29] 10Analytics, 10EventBus, 10WMF-JobQueue, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Wikimedia-production-error: EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached" - https://phabricator.wikimedia.org/T204183 (10WDoranWMF) [18:19:13] 10Analytics, 10EventBus, 10Product-Analytics, 10Core Platform Team (Modern Event Platform (TEC2)): Eventbus revisions are duplicated in event.mediawiki_revision_tags_change - https://phabricator.wikimedia.org/T218246 (10WDoranWMF) [18:28:15] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Core Platform Team Backlog (Later): Create scripts to estimate Kafka queue size per wiki - https://phabricator.wikimedia.org/T182259 (10WDoranWMF) [18:39:22] 10Analytics, 10Analytics-Kanban: Alarming scripts for entrophy alarms. Anomaly detection and reporting. - https://phabricator.wikimedia.org/T227357 (10Nuria) [19:05:10] 10Analytics, 10MediaWiki-API, 10RESTBase-API, 10Core Platform Team Backlog (Later): Top API user agents stats - https://phabricator.wikimedia.org/T142139 (10WDoranWMF) [19:07:29] 10Analytics, 10RESTBase, 10Core Platform Team Backlog (Later): REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245 (10WDoranWMF) [20:12:59] 10Analytics, 10MediaWiki-API, 10RESTBase-API, 10Core Platform Team Backlog (Later): Top API user agents stats - https://phabricator.wikimedia.org/T142139 (10Nuria) I saw this ticket go by, much has changed since it was filed. ActionAPI table has not been updated for a while, a much more reliable flow of da... [20:14:11] fdans: let me know what you think but maybe we could refactor a bit -for simplicity - the media code before we add the new changes ? [21:05:42] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generate edit totals by country by month/year - https://phabricator.wikimedia.org/T215655 (10Nuria) Ping @Milimetric can we merge the monthly job? [21:25:15] 10Analytics-Kanban, 10Product-Analytics: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Nuria) Some recent work on this. @JFishback_WMF is working on risk assessment framework with legal that we can apply to data releases such as this one.... [22:05:16] 10Analytics-Kanban, 10Product-Analytics: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Ijon) Yes, anonymous editors matter too. Though I am mostly interested in the old "active" (>5/month) and "very active" (>100/month) definitions, and h... [23:20:53] GoranSM: Please be so kind to stop your cluster job until we can take a look at how to improve it [23:37:37] nuria: Does this refer to the job that is currently running, or did you send this message before the email informing me that the previous run was killed? [23:53:12] nuria: First of all, I do apologize for hitting hard against our cluster. I simply need a data set urgently. Second, the job that is currently running is some 75% near completion. If you really need to kill it, do it, and we will find a more efficient way. If not, please, please just let it finish and I promise you nothing similar will happen in the foreseeable future. Thanks.