[00:14:24] 10Analytics-Radar, 10Product-Analytics: Provide a list of 100 most popular articles of Russian and English Wikipedias in terms of page views from Ukraine - https://phabricator.wikimedia.org/T273924 (10kzimmerman) 05Open→03Declined @ua_user We're not able to take on this request; the data you're requesting... [00:43:47] 10Analytics, 10Better Use Of Data, 10Product-Data-Infrastructure: Define acceptable usage of the `meta` object in event schemas - https://phabricator.wikimedia.org/T273293 (10kzimmerman) Moving to the backlog until we're ready to pick this up [01:07:34] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (10kzimmerman) [01:11:56] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) [01:16:31] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) a:05Cmjohnson→03RobH racked & cabled, bios configured, network configured. handing over to Rob for imaging [01:37:40] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Nice work, thanks @Jclark-ctr >>! In T260445#6839851, @Jclark-ctr wrote: > racked & cabled, bios configured, network configured. handing ove... [02:10:13] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Remove disabled jobs from reportupdater [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664937 (owner: 10Milimetric) [02:23:42] (03PS1) 10Milimetric: Fix more syntax errors [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664986 [02:24:27] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix more syntax errors [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664986 (owner: 10Milimetric) [02:25:37] (03PS1) 10Milimetric: Remove invalid job, echo tables no longer where expected [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664987 [02:28:11] (03CR) 10Milimetric: "Adding you two because I hope someone remembers why we migrated this job and/or who might be interested in taking a look at the removal. " [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664987 (owner: 10Milimetric) [02:32:56] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Related puppet change:" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664987 (owner: 10Milimetric) [02:34:53] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10Shizhao) >>! 在T274605#6833068中,@Antigng写道: > Thank you, and their mailbox is the follo... [02:55:38] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) updated firmware for idrac for the remainder, will update bios and image tomorrow [03:05:20] PROBLEM - Check the last execution of reportupdater-ee on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-ee https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:58:37] ^ that timer was unused and has been removed [04:00:42] Puppet has run on an-launcher1002, but it still says failed; the timer may need to be manually removed [05:20:33] (03PS2) 10Lex Nasser: Fix unit tests that ensure certain requests fail and clean up all unit tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) [05:58:59] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) I'm thinking of writing up the steps for rebalancing partitions in a wiki article such as https://wikitech.wikimedia.org/wiki/Kafka/Administration, and I'... [06:53:34] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10elukey) I tried to reproduce with ` sudo -u analytics-privatedata kerberos-run-command analytics-privatedata beeline` and the commands that you posted w... [07:11:23] goood morning [07:24:43] Good morning [07:25:42] 10Analytics: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10elukey) @JAllemandou we can finally do this now :) [07:27:29] 10Analytics: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10JAllemandou) Yes! The only concern I have is with the null-value in struct bug we've hit. It seems related to parquet. I think we should do it and possibly revert if too many problems show up :) [07:28:16] 10Analytics: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10elukey) ` 0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat; going to print operations logs printed operations logs Getting log thread is interrupted, since query is done... [07:29:22] 10Analytics-Clusters: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) 05Stalled→03Open [07:29:27] 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey) [07:45:48] 10Analytics: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10JAllemandou) I confirm the change of property does something (newly created table without format is stored as parquet). We need to implement the change in main `hive-site.xml`. Doing it. [08:08:09] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-test-worker1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [08:22:27] 10Analytics, 10Patch-For-Review: Decide to move or not to PrestoSQL/Trino - https://phabricator.wikimedia.org/T266640 (10JAllemandou) \o/ I ran a query on the test-cluster, it's all good on my side :) +1! [08:24:37] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) @elukey Thank you. I think now that we were facing a similar problem already - your comment in T274866#6840079 has just reminded me o... [08:44:05] hi, I'm thinking about using https://python-poetry.org/ for a project that runs on stat1008 (research/mwaddlink), would it be OK to install poetry on stat1008? it is installed per-user and is isolated from the rest of the system so I assume it would be ok, but thought I'd check [08:45:24] Hi kostajh - I'll let one of our SRE answer this question when they come onine (ottomata or razzi) [08:46:27] joal: thx! [08:47:12] kostajh: hi! Just to get more details, did you try with a venv etc..? Usually this is the preferred way for us [08:48:04] elukey: poetry uses a virtualenv, yes [08:50:18] https://github.com/python-poetry/poetry#introduction gives an overview, it's to in theory simplify dependency management and avoid some of the traps/pitfalls one encounters with pip + requirements.txt and other config needed with that [08:50:55] kostajh: is it possible to install poetry via pip first in a venv, and then use it? [08:51:10] we are also trying to push people to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda [08:51:28] no, poetry is the thing that manages the venvs, as I understand it [08:52:50] ok, good to know re: conda [08:53:06] kostajh: so I don't see it in the main debian repositories, IIUC we should package and deploy it for this use case, so if possible I'd suggest to either use pip or Anaconda [08:53:24] ah nice you are going to check, super :) [08:54:42] elukey: I'm not sure it would nee to be in any repos, the installer script downloads python dependencies into a your home directory in ~/.poetry/lib and modifies the user PATH to include ~/.poetry/bin. so there is no impact on the rest of the system python [08:56:16] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-test-worker1003.eqiad.wmnet'] ` and were **ALL** successful. [08:56:59] kostajh: ah ok ok it is similar to a venv basically, if so it should be ok as test. The main issue that we have currently is that there is no clear boundary of what it good to be installed on the stat100x, since security-wise we (as SREs) don't have any control on the status of packages installed etc.. [08:57:08] but this is true for venvs as well [08:57:40] so I don't see anything really problematic, buuut if you could use anaconda it would be nicer for us [08:57:48] (also thanks for reaching out and explaining) [08:58:52] elukey: ok, I'll have a look... it depends how much more frustrated I get with pip :) [08:59:01] joal: first (test) Hadoop worker node on Buster! [08:59:23] elukey: I assume that it works ? :) [08:59:23] * elukey dances [08:59:27] \o/ [08:59:29] it does yes :D [08:59:32] great [08:59:53] https://wikitech.wikimedia.org/wiki/Blubber/User_Guide has support built in for poetry, so I imagine something would have to be done with conda (we build research/mwaddlink as a docker image for kubernetes deployment and also use the same repo on stat1008 for generating datasets) [08:59:53] now I am going to try with one on the backup cluster [09:01:19] kostajh: sure if it is painful to keep things in sync you can definitely try poetry on 1008 [09:03:24] joal: the main annoying thing is that when reimaging (preserving data) there is the chance that the new say hdfs system user will have a different uid, since we don't keep it consistent across the fleet, and files may be uncorretly owned afterwards [09:03:45] I am wondering if it is a good time to think about standardizing the uid [09:04:06] wow - this seems problematic elukey - I assume it'll mean change owneership to correct UUID after reimage? [09:04:14] yes exactly [09:04:18] in the DN dirs [09:04:24] it is a simple chown hdfs:hdfs -R [09:04:25] a lot of work :S [09:04:32] so nothing horrible [09:04:57] ack - it feels like a long job, but if ok ou're the one to know [09:06:03] joal: I think that for this round of reimage it will be needed, buut if we introduce a fixed uid for hdfs/yarn/etc.. it means that the next upgrade to Bulleye will be easier [09:06:19] elukey: makes sense [09:06:59] going to research a bit [09:07:09] joal: in the meantime, ok if I upgrade Presto? [09:07:14] please elukey [09:07:22] all right quick coffee then I'll proceed [09:40:11] joal: ok so coordinator and presto1001 upgraded [09:40:27] ok [09:40:54] superset seems working [09:41:25] elukey: presto works with different versions from coordinator/workers? [09:41:40] or does it only uses the workers having correct version? [09:41:40] joal: nono the rest of the workers are down, forgot to say :) [09:41:45] Ah ok :) [09:41:48] makes sense [09:41:51] I upgraded the client on stat1004, do you want to test? [09:41:54] sure [09:43:12] elukey: all good for me :) [09:43:18] elukey: from stat1004 [09:44:46] perfect, proceeding :) [09:46:48] !log upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x [09:46:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:50:01] (03CR) 10Awight: [C: 04-1] "Thank you! The commas are also fixed in https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/656210/6/codemirror/sql/users_" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664938 (owner: 10Milimetric) [09:57:11] aaand we are done! [09:57:16] \o/ [09:57:23] so presto 226 was released in sept 2019 :P [09:57:42] I confirm elukey - my quesry now runs on 5 nodes :) [09:57:47] gooooooood [09:57:57] joal: we are ready to test alluxio :) [09:58:05] This is an excellent news :) [10:00:51] !log restart hive daemons on an-coord1002 (standby coord) to pick up new default parquet file format change [10:00:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:01:58] ok so the next step would be to failover to an-coord1002, so we can also test if that works [10:02:01] ok joal ? [10:02:11] +1 [10:03:08] 10Analytics, 10Patch-For-Review: Decide to move or not to PrestoSQL/Trino - https://phabricator.wikimedia.org/T266640 (10elukey) Cluster deployed, we are using 0.246 now! This will unblock Alluxio testing.. Still to decide, now that we have a more up to date version.. Trino o PrestoDB? [10:07:27] !log hive failover to an-coord1002 to apply new hive settings to an-coord1001 [10:07:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:07:42] joal: done! Let's see what breaks :D [10:08:42] elukey: all god on my side :) [10:09:28] elukey: using beeline so that I use the rebooted hive-server, checked param value ok, created a test table and validated the fileformat is ok [10:10:21] niceeee [10:13:51] I don't see horrible logs on presto nodes too [10:15:46] really interesting - https://docs.alluxio.io/os/user/stable/en/compute/Presto.html suggests to deploy Trino :) [10:15:58] (still called prestosql) [10:16:01] huhuhu [10:22:10] Bigtop already provided alluxio 2.4, I might try to build the package and upload it so we use the latest [10:22:14] rather than 1.6 [10:22:24] https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-a-HA-Cluster.html [10:23:22] we could think about having two Alluxio masters on the coords [10:23:35] but we have a lot of things already in there :( [10:24:10] anyway, will do some reading, looks very interesting [10:25:22] elukey: coords metrics are not bad - not a lot of usage, and still some RAM available (used by cache now) [10:31:24] joal: I think that next FI we'll need to refresh hw, in case if the two coords model is fine we could just buy nodes with 128/256G of ram and be fine for a while :) [10:31:42] works for me elukey [10:31:56] elukey: I wonder if we should consider bumping presto [10:32:30] joal: moar nodes? [10:32:36] elukey: with multi-layer of storage (RAM, SSDs, HDDs), and possibly having a bit more of them [10:33:51] elukey: maybe it's too soon for that --^ [10:34:04] joal: need to check how old those nodes are, but what I'd do is to repurpose the 5 an-presto workers as hadoop workers (due to the disks that we don't use) and buy more appropriate hw [10:34:24] yes [10:34:50] like NVMe SSDs [10:34:57] and a ton of ram [10:35:00] :) [10:35:07] would be expensive [10:35:50] for ram not so much, we can get hosts with 256/512G of ram easily.. NVMe will cost more yes, but we don't need terabytes of space [10:36:28] elukey: I wonder if we'd better have bigger old-schools SSDs or smaller faster ones [10:36:33] * joal doesn't know [10:36:41] I mean having 5x512G of ram is probably also enough for our use cases [10:36:49] :) [10:36:58] with Alluxio I mean [10:37:14] no idea how much NVMe differs from a regular SSD [10:37:16] you're probably riht :) [10:37:18] price wise [10:37:43] but we wouldn't need to get a lof of those to have raid10 final capacity around TBs [10:38:08] if they break we don't have caching on disks, acceptable in theory [10:38:33] elukey: if they break we have underlying layer of HDDs :) [10:39:18] joal: we can have the root partition on a couple of disks, then 2/4 other SSDs in jbod or similar, options are really a lot [10:39:39] the new misc nodes that dcops suggests to buy have all SSDs [10:39:46] ack [10:39:47] (maybe not top tier ones of course :D) [10:40:40] elukey: Given that most of the data presto reads is immutable (might change), having some capacity is also of interest - but I'm ahead of schedule here :) [10:41:20] joal: better to be because the hw will be kept for 5y! [10:41:29] yeah [10:46:15] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) The reimage of an-test-worker1003 (preserving the `/srv/hadoop` dir) went fine! Bigtop works fine on Buster, and the host was re-added to HDFS nicely. One problem keeps re-occurring... [10:55:00] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10MoritzMuehlenhoff) >>! In T231067#6840556, @elukey wrote: > The reimage of an-test-worker1003 (preserving the `/srv/hadoop` dir) went fine! Bigtop works fine on Buster, and the host was re-a... [11:28:10] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) @razzi this is an interesting problem, I am going to add some context in here :) At the moment we rely on Bigtop deb packages for the creation of users like `hdfs` `yarn` etc.. When... [11:31:51] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) Filippo documented a similar issue for Swift in https://phabricator.wikimedia.org/T123918, where he ended up adding a special use case to late_command.sh (the script used after the e... [11:32:56] !log restart hive daemons on an-coord1001 to pick up new parquet settings [11:32:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:33:09] joal: if you are ok I'll do the failover to an-coord1001 [11:35:19] I have the change ready, will do it after lunch :) [12:13:14] taking a rgbeak [13:10:30] !log failover analytics-hive to an-coord1001 after maintenance (DNS change) [13:10:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:12:57] all good on an-coord side elukey - beeline operational [13:15:00] thanks for checking :) [13:16:11] 10Analytics, 10Analytics-Kanban: Default hive table creation to parquet - needs hive 2.3.0 - https://phabricator.wikimedia.org/T168554 (10elukey) p:05Triage→03Medium a:03JAllemandou [13:51:08] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10Ottomata) If it isn't hard it couldn't hurt! Although, it is a go library, which I'm not sure we have much tooling around dealing with. I think maybe @ema has... [13:51:53] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) I am trying to figure out if there is a quick way to do this in puppet, but the main problem is that if we try to declare a user with a specific uid/gid in puppet then puppet will ov... [14:00:07] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10Ottomata) > the main problem is that if we try to declare a user with a specific uid/gid in puppet then puppet will override it, if already present, during the first puppet run. If we are r... [14:03:05] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10Ottomata) Hm I guess we'd have to deal with the existent users on the already on buster client nodes. [14:03:35] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) @Ottomata this may work, even if we have some hosts already on Buster (say stat100x, an-launcher, etc.. but we can fix those manually in theory). In order to use the require => User... [14:04:46] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10Ottomata) > anyway, so possibly in the stretch use case we should just create the user without fixed uid/gid? Oh ya that could work. Alternatively we could just do `user { 'hdfs': ..., be... [14:06:01] * elukey bbiab! [14:12:47] ottomata: hey.. uhm. [14:12:57] hiya! [14:13:03] So I was looking at my ATSKafka/VRNKafka thing again. [14:13:26] And in the last 5m-10m, the weird imbalance between ATS and Varnish disappeared. [14:13:36] hey congratulations! [14:13:42] Weeeell. [14:13:48] a) I don't know why [14:14:13] b) now ATS shows more than Varnish does, though not quite as extremely as the old imbalance [14:14:26] $ wc -l *.txt [14:14:28] 227448 atsdump.txt [14:14:28] it was the other way around before? [14:14:30] 176251 vrndump.txt [14:15:09] Hang on. No. the imbalance was the same direction, but closer to 2x [14:15:32] The numbers there are 1 msg/line, over 1m runtime [14:15:44] I'll run it for 5m and see if the imbalance veers further. [14:16:57] klausman: i doubt this is related, but maybe? [14:16:57] https://phabricator.wikimedia.org/T244843#6840321 [14:16:59] just saw it roll by [14:17:01] (03Abandoned) 10Milimetric: Fix syntax error [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664938 (owner: 10Milimetric) [14:17:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/665089/ [14:17:08] I am seeing close to 3400m/s on ATS, 2900m/s on Varnish [14:18:03] The requests I see in both topics *look* valid [14:18:07] aye ok [14:21:17] 1203264 atsdump.txt [14:21:19] 841057 vrndump.txt [14:21:28] Not 2x, but not trivial either :-/ [14:31:29] hey there! how are spark metrics collected on the cluster? Do we have any integration with prometheus/grafana or something similar? [14:31:31] I'd like to extract some counters for a job, can I safely configure a CSV sink? [14:31:53] apologies if this is FAQ, but I could not find this info on wikitech :( [14:35:01] 10Analytics-Radar, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) 05Resolved→03Open @Cmjohnson sorry if it took me so long to answer but I noticed this updated only now. The two disks that I have now on an-coord1002 may not b... [14:58:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add TemplateDataEditor schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664801 (https://phabricator.wikimedia.org/T275012) (owner: 10Awight) [14:58:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add TwoColConflictExit schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664803 (https://phabricator.wikimedia.org/T275014) (owner: 10Awight) [15:00:00] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add TwoColConflictConflict schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664802 (https://phabricator.wikimedia.org/T275013) (owner: 10Awight) [15:00:53] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add TemplateDataApi schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664799 (https://phabricator.wikimedia.org/T275011) (owner: 10Awight) [15:01:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add ReferencePreviewsPopups schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664798 (https://phabricator.wikimedia.org/T275009) (owner: 10Awight) [15:01:50] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add ReferencePreviewsBaseline schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664795 (https://phabricator.wikimedia.org/T275007) (owner: 10Awight) [15:02:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add ReferencePreviewsCite schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664796 (https://phabricator.wikimedia.org/T275008) (owner: 10Awight) [15:03:12] (03CR) 10Thiemo Kreuz (WMDE): Add VisualEditorTemplateDialogUse schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [15:03:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add CodeMirrorUsage schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664792 (https://phabricator.wikimedia.org/T275005) (owner: 10Awight) [15:04:52] 10Analytics-Radar, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) I am going to attempt to add the new disk to the existing md array, let's see how it goes :) [15:05:24] elukey: if you have any mdadm-related questions, lmk. I've been wrangling that beast for a while [15:09:35] klausman: sure thanks! So I have a sw raid 1 on an-coord1002, and one of the disks failed. The host is OOW and Chris replaced it with another disk, bigger in size and with a different sector size. Now for RAID1 I think it is fine overall to add the disk in (extra space will not be used, but striping etc.. shouldn't be a concern space wise). Also the logical sector size is 512 for both disks [15:09:41] in fdisk, in theory it should be fine but never really done it [15:10:05] (03CR) 10Thiemo Kreuz (WMDE): Use edit count bucket sent by TemplateWizard (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [15:10:12] Yeah, that should just work. [15:10:26] perfect, will try later to break it :D [15:10:50] The physical sector size will make the performance a bit skewed, but nothing that breaks stuff [15:11:09] yes I was thinking the same [15:11:14] (03CR) 10Thiemo Kreuz (WMDE): Use edit count bucket sent by TemplateWizard (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [15:11:17] thanks for the brainbounce :) [15:13:19] Is the broken disk still visible? [15:13:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Use the edit count bucket sent by TemplateData (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE) [15:14:30] if so, this command should DTRT: mdadm --manage /dev/mdX --replace /dev/broken --with /dev/new [15:15:42] nope it is not, I thought to use sfdisk to recreate the partitions on the new disk [15:15:45] and then add [15:15:56] yeah, that should work just fine [15:18:18] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Segment CodeMirror metrics by user edit count (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656210 (https://phabricator.wikimedia.org/T273471) (owner: 10Awight) [15:29:43] Hi gmodena - we don't have integration for spark jobs metrics onto prometheus - CSV sink is easiest IMO :) [15:39:42] joal roger that! [15:40:28] gmodena: you can also have fun with that if needed: https://github.com/criteo/babar [15:40:40] gmodena: I used it, it's cool [15:41:21] joal that looks neat [15:41:27] i did not know it, thanks for the pointer [15:41:47] gmodena: it can help for memory stuff, not really for applicative counters [15:48:06] !log stop hive/mysql on an-coord1002 as precautionary step to rebuild the md array [15:48:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:49:27] it will take a couple of hours probably [16:29:06] Make sure you're not capping the rebuild speed. I think the default max value is 200M/s [16:29:17] Depending on the disks, you might hit that [16:29:36] (I've definitely hit that with SSDs, but I suspect we're talking rust here) [16:39:29] (03CR) 10Awight: Use edit count bucket sent by TemplateWizard (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [16:41:22] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) an-worker11(29|33|34|39|40|41): [x] idrac firmware updated [x] bios firmware updated [x] idrac and bios settings & password upd... [16:41:51] (03CR) 10Awight: Use the edit count bucket sent by TemplateData (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE) [16:46:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Use the edit count bucket sent by TemplateData (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE) [16:48:06] 10Analytics, 10Event-Platform: Schema tests should validate examples - https://phabricator.wikimedia.org/T275143 (10awight) [16:48:25] (03CR) 10Thiemo Kreuz (WMDE): Use edit count bucket sent by TemplateWizard (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [16:48:36] (03CR) 10Awight: Add VisualEditorTemplateDialogUse schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [16:50:22] a-team: andrew and I will be about 5 min late to standup [16:52:49] (03CR) 10Awight: Use edit count bucket sent by TemplateWizard (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [16:54:13] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1129.eqiad.wmnet',... [17:00:40] (03PS3) 10Awight: Add TemplateDataApi schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664799 (https://phabricator.wikimedia.org/T275011) [17:00:45] (03CR) 10Awight: Add TemplateDataApi schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664799 (https://phabricator.wikimedia.org/T275011) (owner: 10Awight) [17:01:42] (03CR) 10jerkins-bot: [V: 04-1] Add TemplateDataApi schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664799 (https://phabricator.wikimedia.org/T275011) (owner: 10Awight) [17:01:44] milimetric: standup? [17:10:03] (03PS3) 10Awight: Add TemplateDataEditor schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664801 (https://phabricator.wikimedia.org/T275012) [17:10:05] (03CR) 10Awight: Add TemplateDataEditor schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664801 (https://phabricator.wikimedia.org/T275012) (owner: 10Awight) [17:12:43] (03CR) 10Milimetric: [C: 03+2] Fix unit tests that ensure certain requests fail and clean up all unit tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser) [17:14:48] (03Merged) 10jenkins-bot: Fix unit tests that ensure certain requests fail and clean up all unit tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser) [17:15:59] (03PS4) 10Awight: Add TemplateDataApi schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664799 (https://phabricator.wikimedia.org/T275011) [17:18:17] (03PS1) 10Razzi: Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 [17:21:37] Does the normalized title here include the namespace? (and, if so in "canonical" English or localized?) [17:21:40] https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/mediawiki/page/common/current.yaml [17:26:40] I'm assuming no namespace, specifically because of this localization question being so nicely avoided by having the page_namespace ID in that same schema. [17:27:44] !log an-coord1002 back in service with raid1 configured [17:27:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:29:11] 10Analytics-Radar, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) 05Open→03Resolved It seems to have worked, thanks! [17:32:10] (03CR) 10Awight: Add TemplateDataEditor schema to analytics/legacy (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664801 (https://phabricator.wikimedia.org/T275012) (owner: 10Awight) [17:38:08] 10Analytics: Upgrade Matomo to latest upstream - https://phabricator.wikimedia.org/T275144 (10elukey) [17:40:28] joal: we are close to get the new AQS nodes \o/ [17:40:33] \o/ [17:40:53] Just out of curiosity, is there a place I can find metrics on request counts/rates for each AQS endpoint over time? [17:41:29] lexnasser: https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 should have some infos [17:43:49] elukey: yeah, I saw that, but I was wondering if I could find data on individual endpoints, like editors per-country for example [17:44:13] or is there a hive table that holds aqs request data? [17:45:45] lexnasser: indeed there is :) [17:45:52] there might be metrics that we currently don't display [17:45:54] lexnasser: wmf.aqs_hourly [17:46:16] lexnasser: there is some work to be done to split endpoints, but it's allthere [17:48:14] joal, elukey: thanks! I'll check out wmf.aqs_hourly [17:55:54] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) I tried to deploy superset to the staging box, but it failed with `aiohttp-3.7.3-cp37-cp37m-manylinux2014_x86_64.whl is not a supported wheel on this platform`. Furthermore the rollback failed with ` Rollback all deplo... [18:00:41] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "So it needs fixing? Setting a -1 to make this visible." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/664804 (owner: 10Awight) [18:04:35] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet', 'an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker11... [18:27:02] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-worker1129.eqiad.wmnet `... [18:27:08] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1129.... [18:27:23] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-worker1129.eqiad.wmnet `... [18:30:47] (03CR) 10Milimetric: [C: 03+1] "This looks great to me. Test it, get that WIP out of the commit message, and I'll take another look and merge." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [18:36:26] BTW, pererverence lands on mars today!!!!! [18:36:27] https://www.youtube.com/watch?v=gm0b_ijaYMQ [18:38:25] !log rebalance kafka partition for webrequest_upload partition 1 [18:38:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:00:37] * elukey afk! [19:28:01] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1129.... [19:28:35] milimetric: look at that: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngineClassifier.java [19:29:36] milimetric: the UDF exists - it's about updating the regexes [19:41:11] Gone for tonight - see y'all [19:42:14] byeeee [19:47:20] Actually - my latest test succeeded - I have a working version of Gobblin with HDFS! [19:47:57] * joal goes to diner dancing in happyness :) [19:48:10] woohoooo! [19:53:36] joal OWOOHOOO [20:17:47] 10Analytics-EventLogging, 10Analytics-Radar, 10MediaWiki-extensions-CollaborationKit: Decide on JSON validation library - https://phabricator.wikimedia.org/T147137 (10Ostrzyciel) [20:41:55] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: eventgate-wikimedia should emit metrics about validation errors - https://phabricator.wikimedia.org/T257237 (10Ottomata) Great! https://icinga.wikimedia.org/icinga/ [20:42:48] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Rematerialize all event schemas with enforceeNumericBounds - https://phabricator.wikimedia.org/T273069 (10Ottomata) https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/661959 [20:53:29] (03PS2) 10Razzi: Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 [21:16:31] (03PS3) 10Razzi: Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 [21:21:36] razzi: o/ [21:21:53] 10Analytics, 10Analytics-Kanban, 10observability, 10Patch-For-Review: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10Ottomata) @razzi ok FYI I've got this alert going now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafk... [21:24:27] 10Analytics, 10Analytics-Kanban, 10observability, 10Patch-For-Review: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10Ottomata) Also, FYI @herron and @colewhite since this will apply to the Kafka main and logging clusters too. [21:24:41] 10Analytics, 10Analytics-Kanban, 10observability, 10Patch-For-Review: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10Ottomata) a:03Ottomata [21:26:02] hey ottomata, nice work on the new alarm! Are we ready to disable the old ones? [21:26:20] https://phabricator.wikimedia.org/T273702#6842001 [21:26:21] :) [21:26:31] leets wait til next week [21:27:05] razzi: am avail to sync for a bit if you wanna [21:27:32] ottomata: cool, I'll meet you in bc [21:27:35] k [21:37:54] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1133.eqiad.wmnet',... [21:58:53] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Product-Analytics, and 2 others: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10JKatzWMF) [22:01:44] (03PS4) 10Razzi: Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 [22:13:41] hey a-team: do we have a test version of a newer Superset available? [22:14:38] I'd like to mess around with some visualizations [22:14:47] Nettrom: hm, I haven't heard Lu-ca mention that [22:15:01] Nettrom: Working on it currently! Not ready yet though [22:15:09] oh, cool razzi :] [22:15:43] razzi: cool, I'll wait for announcements then, thanks! :) [22:23:09] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Patch-For-Review: Create Oozie job for session length - https://phabricator.wikimedia.org/T273116 (10mforns) I finished and tested the Oozie job, and seems to be working fine! @Mayakp.wiki and I sync'ed up on data checks and, while Maya is making sur... [22:35:35] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1133.eqiad.wmnet',... [22:56:47] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1140.eqia... [23:04:28] (03CR) 10Mforns: [V: 03+2] "Hi! I think this is ready for review." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [23:09:34] bye teamm! see you tomorrow [23:20:17] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) Ok, the problem was that I had upgraded the pip version in the docker container when building the wheels, which made the wheels incompatible with the staging server. I was able to keep the old pip version and build all pa... [23:21:33] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: shorten welcome survey retention to 90 days - https://phabricator.wikimedia.org/T275171 (10MMiller_WMF) [23:23:09] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10MMiller_WMF) [23:26:40] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) John, In reviewing the installations from the relocation of an-worker11(29|33|34|39|40|41), I ran into a couple issues: an-wo...