[01:20:12] (SystemdUnitFailed) firing: (10) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:12] (SystemdUnitFailed) firing: (10) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:42] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JAllemandou) This is done :) @Htriedman you can now move your files to `hdfs:///wmf/data/published/datasets/...` and they'll be synchronized to... [09:05:54] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JAllemandou) [09:06:47] joal: I'm going to go for a refinery deploy now, if that's OK with you. Thanks for merging my pageview allowlist patch yesterday. [09:07:49] I'm going to be on the lookout for T334493 and I'll investigate if I see it happening. [09:07:50] T334493: anlytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 [09:12:23] Ah, I remember. Wednesdays. :) [09:12:55] !log deploying refinery [09:12:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:13:05] Hi btullis :) [09:13:26] Hello. Sorry if I pinged when you were busy :-) [09:13:37] no no all good :) [09:16:36] So you're ok for me to proceed? Everything looks ok so far. [09:18:12] For sure! I don't there is anything wildly particluar for this week dpeloy [09:18:42] And as you were saying, the train allows you to investigate our issue - all good :) [09:19:02] I'm around today if you need me btullis - kids are on holidays, so I'll be almost classical schedule :) [09:19:05] Ack, thanks. Will let you know if anything unusual pops up. [09:20:12] (SystemdUnitFailed) firing: (10) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:16] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kbdwiktionary - https://phabricator.wikimedia.org/T333270 (10Marostegui) Database `_p` created and grants created. This is ready for views creation. [09:22:50] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for fatwiki - https://phabricator.wikimedia.org/T335018 (10Marostegui) 05Open→03Resolved a:03Marostegui [09:23:41] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for ckbwiktionary - https://phabricator.wikimedia.org/T331834 (10Marostegui) 05Open→03Resolved a:03Marostegui Just checked and all good [09:24:26] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for azwikimedia - https://phabricator.wikimedia.org/T330442 (10Marostegui) 05Open→03Resolved a:03Marostegui Just checked and all good [09:24:38] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for vewikimedia - https://phabricator.wikimedia.org/T330704 (10Marostegui) 05Open→03Resolved a:03Marostegui Just checked and all good [09:24:50] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for guwwikinews - https://phabricator.wikimedia.org/T334408 (10Marostegui) 05Open→03Resolved a:03Marostegui Just checked and all good [09:25:22] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kcgwiktionary - https://phabricator.wikimedia.org/T334739 (10Marostegui) 05Open→03Resolved a:03Marostegui Just checked and all good [09:25:59] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kbdwiktionary - https://phabricator.wikimedia.org/T333270 (10Marostegui) 05Open→03Resolved a:03Marostegui Just checked and all good [09:28:02] 10Data-Engineering, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10Kizule) Can someone sort this out? There are still related tables on cloud. `lines=10 MariaDB [ptwikisource_p]> SHOW TABLES; +--------------------------+ | Tables_i... [09:39:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:49:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:56:19] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:56:42] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:57:59] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:58:34] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:59:01] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [10:28:31] FYI, the `refinery-deploy-to-hdfs`step of the refinery deploy still isn't working. It's related to this: https://phabricator.wikimedia.org/T335354 [10:28:53] I'm investigating solutions and I've added a comment to the ticket, describing how it affects us. [10:48:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,produce_canary_events.service,refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:12] (SystemdUnitFailed) firing: (13) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:40] !log deploying refinery to hdfs [10:55:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:00:12] (SystemdUnitFailed) firing: (13) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:12] (SystemdUnitFailed) firing: (13) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:14] !log restart refine_netflow service on an-launcher1002. [11:12:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:12:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:17] (SystemdUnitFailed) firing: (13) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:33] btullis: If I understand correctly, the only git command excuted in that "post deploy" work flow is the "git log" which validates that the checkout to be pushed to HDFS is up-to-date, right? [11:28:42] I see two options: [11:29:07] 1. we provide a wrapper which runs that "git-log" under the analytics-deploy user [11:29:57] moritzm: I think that there is something in the `refinery-deploy-to-hdfs `script that doesn't like it either. I'll see if I can find something... [11:30:02] 2. we ship a git::systemconfig which sets safe.directory for /srv/deployment/analytics/refinery [11:31:43] yeah, that scripts run git describe at leat [11:32:57] I'm leaning towards 2. in that case, let me propose a patch in Gerrit [11:32:59] So I think that (2) is probably best for this. [11:33:05] Snap! [11:35:27] Another alternative would be to stop using an-launcher1002 and update the process to use a deployment host instead, but that's a bit more involved. [11:45:10] btullis: at some hopefully not too long in the future we wish to not having to dpeloy refinery as often as we do [11:45:39] btullis: we wish to separate the airflow HQL code from other code, which will probably lead to a reorganization of repos [11:45:50] Having a dedicated deployment host feels overkill [11:47:09] joal: Great! I'm all in favour. What about adding in a bit of gitops and continuous deployment? :-) [11:47:23] btullis: Would love that :) [11:48:20] btullis: it's relevant for some of our code, while for some other we need an HDFS integration (done trough airflow) [11:48:47] This entails a good bit of work though, maybe/hopefully this fiscalyear? [11:49:15] Ack. Count me in. [12:33:17] 10Data-Engineering-Planning, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10VirginiaPoundstone) [12:33:46] 10Data-Engineering-Planning, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10VirginiaPoundstone) [12:40:20] btullis: reading https://phabricator.wikimedia.org/T335354#8807059, running refinery-deploy-to-hdfs from deployemnt server sounds cool, but might not be great because generally the git fat artifacts are not synced to the deployment server [12:40:33] they are pulled as a post deploy step on each deploy host [12:46:02] 10Data-Engineering, 10Anti-Harassment, 10Event-Platform Value Stream, 10Privacy Engineering, and 3 others: Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal - https://phabricator.wikimedia.org/T200559 (10Ottomata) [12:46:15] 10Data-Engineering, 10Anti-Harassment, 10Event-Platform Value Stream, 10Privacy Engineering, and 3 others: Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal - https://phabricator.wikimedia.org/T200559 (10Ottomata) cc @lbowmaker @gmodena [12:54:34] ottomata: OK, got it. Thanks. So I think that the option (2) suggested my m.oritzm above sounds like the best approach then. [12:56:36] ya sounds good [13:16:31] quick followup question, from which are the steps outlined at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Deploy/Refinery#How_to_deploy usually done? [13:16:41] profile::analytics::refinery is added on 12 hosts in total [13:17:06] airflow1001/1005, an-coord1001, an-launcher1002 and the stat hosts [13:17:17] plus unrelated an-test* ones [13:17:41] so should I add the git config to the profile that it's available on all or specifically only to an-launcher1002? [13:17:56] which seems to have been the server from which this was first noticed/reported [13:18:34] It's usually done from an-launcher1002 for the prod-hadoop cluster and from an-test-coord1001 for the test hadoop cluster, I believe. However, other people have more experience of deploying this than I have. [13:22:06] I'm tempted to say an-coord100[1-2], an-test-coord1001, an-launcher1002 - I think that this task //could// be run on an-coord100[1-2] as it already has the right keytabs. [13:23:41] ok, then I'm adding the git config to a separate profile and will add this to the respective roles [13:24:48] Ack, many thanks. [14:03:31] 10Data-Engineering, 10Event-Platform Value Stream: Upgrade Flink Image to 1.17 - https://phabricator.wikimedia.org/T335408 (10Ottomata) [14:03:47] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): Upgrade Flink Image to 1.17 - https://phabricator.wikimedia.org/T335408 (10Ottomata) [14:11:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): mediawiki-event-enrichment: issue async requests from ProcessFunction - https://phabricator.wikimedia.org/T332948 (10Ottomata) [14:27:08] ottomata, joal: when debugging the PCC to add the git config (https://puppet-compiler.wmflabs.org/output/912301/1759/) I realised https://github.com/wikimedia/operations-puppet/commit/a9f74b682de91317d8df9785fe9afd6cf321ee73 broke things: [14:27:32] this renames profile::analytics::hdfs_tools to "class hdfs_tools" [14:28:11] but prpfile::analytics::hdfs_tools is still included in profile::analytics::cluster::client [14:28:20] profile::analytics::hdfs_tools [15:01:14] (03CR) 10Clare Ming: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [15:01:29] oh ho [15:15:12] (SystemdUnitFailed) firing: (10) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:19] moritzm: I'm very sorry - I completely missed that [15:34:39] moritzm: fixed https://gerrit.wikimedia.org/r/c/operations/puppet/+/912316 [15:35:49] thx [15:36:01] joal: no worries, easy to miss :-) [16:20:13] mforns: Would you have aminute for me? [16:20:22] sure! batcave? [16:20:26] OMW [16:26:10] (03CR) 10Snwachukwu: Migrate pageview druid load hql queries to Airflow (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910520 (https://phabricator.wikimedia.org/T334104) (owner: 10Snwachukwu) [16:40:13] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) The last issue has been fixed in T333009: for k8s nodes we just allow `others` to read the devices. The new ROCm suite has been imported for... [16:51:54] (03PS2) 10Snwachukwu: Migrate pageview druid load hql queries to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910520 (https://phabricator.wikimedia.org/T334104) [16:51:56] PROBLEM - IPMI Sensor Status on aqs2008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:48:12] Hi mforns and xcollazo - would you have a minute now? [17:48:26] I can [17:49:59] pinging xcollazo on slack [17:51:21] joal: do you have a couple minutes to re-review https://gerrit.wikimedia.org/r/c/analytics/refinery/+/910092 please? I'd like to deploy today if possible. [17:55:00] mforns: batcave with xcollazo ? [17:55:07] ok [18:03:11] (03CR) 10Joal: "Commented on the first file" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [18:08:37] oh, mforns - batcave again for the CR? [18:09:34] heya joal I'm in it [18:09:53] but I've read your comments, they make total sense. Will change those [18:11:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:31:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:34:53] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:13] (SystemdUnitFailed) firing: (11) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:37] (03PS4) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) [19:04:50] (03CR) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [19:17:52] (03PS1) 10Milimetric: Adapt virtualpageview druid scripts to spark [analytics/refinery] - 10https://gerrit.wikimedia.org/r/912360 (https://phabricator.wikimedia.org/T334105) [19:46:16] (03PS5) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) [19:49:48] (03CR) 10Mforns: "For the record: Antoine was concerned that if Hive does not recognize tables created with Spark syntax, HiveToDruid would not be able to r" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [20:50:31] (03PS4) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) [20:52:54] (03CR) 10Mforns: "Hey Antoine :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) (owner: 10Mforns) [20:56:41] (03PS3) 10Kimberly Sarabia: Creates web schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) [20:57:12] (03CR) 10CI reject: [V: 04-1] Creates web schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [21:00:45] (03PS5) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) [21:39:44] 10Data-Engineering-Planning, 10XTools, 10Chinese-Sites: Run maintain-views on zhwiki, newiki - https://phabricator.wikimedia.org/T334041 (10MusikAnimal) @lbowmaker Any chance we could get an estimate on when you think this task can be fulfilled? My naive understanding is that it's as simple as running a sing... [21:45:42] (03PS4) 10Kimberly Sarabia: Creates web schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) [22:00:54] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:13] (SystemdUnitFailed) firing: (11) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:35] 10Data-Engineering-Icebox: Improve Bot Detection Heuristics - https://phabricator.wikimedia.org/T310846 (10Mayakp.wiki) In 2023 pageview data, we are seeing spikes in automated traffic that are now affecting external (search engine) referrer traffic ([[ https://w.wiki/6dbK | chart ]]) {F36963923} We need to i...