HC1: Operations SIG 26 Aug 2024
Agenda
What: Meeting to Discuss Improving Node Operations as part of the HC1: Operations SIG
When: 26 Aug 2024 @ 6PM EST
Where: https://matrix.to/#/#sig-operations:hc1.chat
Chair: 0x3639
Agenda:
- Discuss follow Up items from previous meeting
- Document action items
- Establish next meeting
If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.
Pre-meeting Notes
- Added `grafana.sh` https://github.com/go-zenon/go/blob/main/grafana.sh. This automates the installation of grafana, node_exporter & promethesus. It creates a default promethesus datasource, scrapes the node_exporter endpoint, and installs a node_exporter dashboard. Tested on amd64. Need to add arm64 support.
- Started to investigate a custom dashboard for znnd. I created this for docker previous. It leveraged the JSON API data source for Grafana. However, this plugin is now in maintenance mode, no new features will be added. Grafana recommends using the Infinity data source plugin instead.
- I started to investigate the Infinity data source plugin. It will be used to scrape the api endpoints to report `syncStatus` and other important metrics.
- Next we can consider installing Loki to manage log files. We can discuss at the meeting.
- Opened PR for arm64 support https://github.com/go-zenon/go/pull/9, needs testing
- Added issue https://github.com/go-zenon/go/issues/10
- Made ASCII Art more readable at lower resolutions.
- Added --help flag https://github.com/go-zenon/go/pull/8
- I can test arm64 support, will be spawning an arm VPS for the Supernova testnet.
Minutes
Mon, Aug 26, 2024, 17:03:11 - deeznnutz: Thx everyone for contributing to the go-zenon
bash script. We are making good progress.
Mon, Aug 26, 2024, 17:03:21 - deeznnutz: I merged in @coinselor's PR #8 to improve the ASCII art and add a --help
flag.
Mon, Aug 26, 2024, 17:03:34 - deeznnutz: those changes were pretty straight forward
Mon, Aug 26, 2024, 17:03:42 - deeznnutz: George submitted the PR for arm64 support. I have not tested it yet. Once we test it we can pull in that change. It's pretty simple. He submitted an issue to make sure the script checks for apt
and systemd
. Should we clarify that as a requirement or have the script check for the proper operating system and systemd?
Mon, Aug 26, 2024, 17:04:05 - georgezgeorgez: I think it's fine just to document it for now.
Mon, Aug 26, 2024, 17:04:15 - georgezgeorgez: I think it's okay for us to do 1 deployment target really well first.
Mon, Aug 26, 2024, 17:04:45 - georgezgeorgez: The people who need the most support will probably be choosing ubuntu/deb as their recommended OS.
Mon, Aug 26, 2024, 17:04:51 - deeznnutz: ya, makes sense. Should we check in the script and halt it if apt and systemd are not present?
Mon, Aug 26, 2024, 17:05:25 - georgezgeorgez: We could do that, but not a priority.
Mon, Aug 26, 2024, 17:05:46 - deeznnutz: OK - I can add that as a todo and we can deal with it later.
Mon, Aug 26, 2024, 17:05:47 - georgezgeorgez: We should try and get someone to use this script in the wild asap and get information about their nodes via the monitoring
Mon, Aug 26, 2024, 17:06:03 - deeznnutz: I setup a stand alone script to automates the installation of grafana, node_exporter & promethesus. It creates a default promethesus datasource, scrapes the node_exporter endpoint, and installs a default node_exporter dashboard. It currently only works on amd64.
Mon, Aug 26, 2024, 17:06:21 - deeznnutz: TODO
- We need to expand functionality to arm64
- Add a custom dashboard for
znnd
. This will require installing the Infinity data plugin and adding a new datasource (you can add an API endpoint as a datasource and it scrapes the API atx
interval). - Potential
znnd
metrics to show:- Sync status
- currentHeight
- targetHeight
- version
- commit
- numPeers
- stats.osInfo
- What else should we include?
Mon, Aug 26, 2024, 17:06:35 - georgezgeorgez: What is the infinity data plugin?
Mon, Aug 26, 2024, 17:06:45 - deeznnutz: it's a plugin that allows curl calls
Mon, Aug 26, 2024, 17:07:03 - deeznnutz: it basically runs them on a schedule and then you can display the data in a dashboard
Mon, Aug 26, 2024, 17:07:22 - georgezgeorgez: gotcha. That might be the fastest way
Mon, Aug 26, 2024, 17:07:34 - georgezgeorgez: There could be other relatively quick methods like parsing logs
Mon, Aug 26, 2024, 17:07:54 - deeznnutz: previously I used JSON API and it worked great. but that plugin is no longer under development
Mon, Aug 26, 2024, 17:08:04 - coinselor: I think syrius shows quite a few znnd metrics, we could use that as reference
Mon, Aug 26, 2024, 17:08:49 - georgezgeorgez: Long term, I think we should consider building metrics into the node I think https://opentelemetry.io/ is worth considering, but not really the next step for us
Mon, Aug 26, 2024, 17:09:15 - deeznnutz: that would be awesome.
Mon, Aug 26, 2024, 17:09:34 - georgezgeorgez: In terms of other metrics, what would help us debug a production issue or a testnet failure?
Mon, Aug 26, 2024, 17:09:45 - georgezgeorgez: We might need different dashboards for prod and dev envs
Mon, Aug 26, 2024, 17:10:00 - deeznnutz: We can add Loki the log processor
Mon, Aug 26, 2024, 17:10:15 - deeznnutz: I've tested that before. it can parse all the logs and you can display them any way you want
Mon, Aug 26, 2024, 17:10:48 - georgezgeorgez: Grafana has something called the LGTM stack https://grafana.com/go/webinar/getting-started-with-grafana-lgtm-stack/
Mon, Aug 26, 2024, 17:10:56 - georgezgeorgez: I'm not familiar with Tempo or Mimir
Mon, Aug 26, 2024, 17:11:29 - deeznnutz: cool - I've never seen that before. I can check it out
Mon, Aug 26, 2024, 17:12:34 - georgezgeorgez: These days, tools are being developed so fast it seems. I think we just go with something, relatively modern, and then if there's a big reason to change, we change. A few years ago, ELK stack was pretty popular, but I think less now. And I think it's a bit overkill. If there is a criteria, we should consider how lightweight the stack is.
Mon, Aug 26, 2024, 17:12:52 - georgezgeorgez: Considering that any resources used for the monitoring stack is taking away from znnd in a single node deploy
Mon, Aug 26, 2024, 17:13:41 - deeznnutz: so the next steps are arm support, Infinity data plugin, create znnd dashboard
Mon, Aug 26, 2024, 17:13:44 - georgezgeorgez: I'm not 100% sure how useful log aggregation will be for a single node
Mon, Aug 26, 2024, 17:14:06 - georgezgeorgez: Considering that all the logs will just be on the box itself
Mon, Aug 26, 2024, 17:14:23 - coinselor: Aren't we making the monitoring stack optional when using the script?
Mon, Aug 26, 2024, 17:14:24 - georgezgeorgez: But if it helps people isolate the logs around a certain timeframe/metric spike it could still be useful
Mon, Aug 26, 2024, 17:14:35 - deeznnutz: we could consider a --send-logs
flag
Mon, Aug 26, 2024, 17:15:13 - georgezgeorgez: <@coinselor
.chat "Aren't we making the monitoring ..."> Yes optional, but hopefully it's useful enough where most node operators want to run it. So lightweight is better imo
Mon, Aug 26, 2024, 17:15:24 - deeznnutz: <@coinselor
.chat "Aren't we making the monitoring ..."> this was one of my questions. I assumed we would add a flag for --grafana
to install it separately
Mon, Aug 26, 2024, 17:15:47 - coinselor: I can work on the interactivity of the script. I should be able to look at how the script is installing all the stuff deez is adding and make it interactive so that the user has to choose what to install. Maybe we can make the monitoring stack the (Default) option
Mon, Aug 26, 2024, 17:16:19 - georgezgeorgez: deeznnutz: you are the chair. You run a pillar and nodes. What would actually be useful to you? How can we get feedback about what is important for other operators?
Mon, Aug 26, 2024, 17:16:57 - georgezgeorgez: As chair, you should try and get feedback from users/stakeholders
Mon, Aug 26, 2024, 17:17:13 - georgezgeorgez: Maybe a survey to pillars?
Mon, Aug 26, 2024, 17:17:45 - deeznnutz: ya, makes sense. It would be super helpful to me when trouble shooting stuff if I could get logs and settings when helping someone
Mon, Aug 26, 2024, 17:17:59 - coinselor: I think the survey might be more useful after we have them use the script for the first time, then get their feedback.
Mon, Aug 26, 2024, 17:18:13 - deeznnutz: i always go through a series of questions that are super simple before getting into helping someone.
Mon, Aug 26, 2024, 17:18:35 - georgezgeorgez: nice, that is the basis of the "diagnostics" i talked about
Mon, Aug 26, 2024, 17:18:43 - deeznnutz: but regarding others, I can ask them what would be useful to them as a pillar/operator
Mon, Aug 26, 2024, 17:18:55 - georgezgeorgez: yeah we can do it informally to start
Mon, Aug 26, 2024, 17:19:18 - georgezgeorgez: i just want to make sure we're building stuff with guidance from the actual community
Mon, Aug 26, 2024, 17:19:37 - georgezgeorgez: i mean we're part of the community, but broader feedback
Mon, Aug 26, 2024, 17:19:49 - deeznnutz: what about setting up a producer address like the znn controller does.
Mon, Aug 26, 2024, 17:20:02 - deeznnutz: should we have a --producer
flag that setups up a producer address?
Mon, Aug 26, 2024, 17:20:20 - georgezgeorgez: i think that is only necessary for pillars
Mon, Aug 26, 2024, 17:20:34 - georgezgeorgez: so if that is our initial target user then yeah we would need it
Mon, Aug 26, 2024, 17:20:42 - georgezgeorgez: but changing the producer also requires changing it on-chain
Mon, Aug 26, 2024, 17:20:53 - georgezgeorgez: some people might want to re-use an existing producer
Mon, Aug 26, 2024, 17:21:10 - georgezgeorgez: maybe that would be considered a bad practice
Mon, Aug 26, 2024, 17:21:29 - deeznnutz: can a producer address be created with the CLI
Mon, Aug 26, 2024, 17:21:45 - deeznnutz: I've never created one before without using the znn-controller-software
Mon, Aug 26, 2024, 17:22:19 - deeznnutz: <@coinselor
.chat "I think the survey might be more..."> maybe we do it before and after
Mon, Aug 26, 2024, 17:22:43 - georgezgeorgez: the producer is just a key-pair. The node configuration has to specify the file to use
Mon, Aug 26, 2024, 17:22:51 - deeznnutz: for example I know shai wants better monitoring tools. Would be interesting to get his feedback
Mon, Aug 26, 2024, 17:23:39 - deeznnutz: right, in the config.json
Mon, Aug 26, 2024, 17:24:01 - coinselor: informally asking before sounds good to brainstorm ideas, but I won't be shocked if someone goes 'a tg bot that alerts me about node going down' and similar requests
Mon, Aug 26, 2024, 17:25:26 - georgezgeorgez: Sometimes a user doesn't exactly know what they want 😅. It's up to us to translate requests into underlying problems and solve those. The surface level suggestion sometimes will be and sometimes won't be the best path
Mon, Aug 26, 2024, 17:25:43 - georgezgeorgez: So another target user could be developers
Mon, Aug 26, 2024, 17:25:55 - georgezgeorgez: I created a "devnet' branch of znnd way back
Mon, Aug 26, 2024, 17:26:11 - georgezgeorgez: And it sets up the producer and config necessary for a single node testnet
Mon, Aug 26, 2024, 17:26:39 - georgezgeorgez: It's baked into znnd. And it means that in order to use it, developers have to rebase their changes on top of the branch
Mon, Aug 26, 2024, 17:26:45 - georgezgeorgez: It would be better if creating a devnet was a separate script
Mon, Aug 26, 2024, 17:26:56 - georgezgeorgez: Not tied to a specific branch of go-zenon
Mon, Aug 26, 2024, 17:27:35 - georgezgeorgez: But I think for Operations, we should focus on node operators first
Mon, Aug 26, 2024, 17:27:40 - deeznnutz: So maybe I can start creating issues in GH for this additional functionality.
Mon, Aug 26, 2024, 17:28:32 - georgezgeorgez: Yeah it's no problem to define more work
Mon, Aug 26, 2024, 17:28:56 - georgezgeorgez: We should have a selection of possible things to do and then work with the users/stakeholders to pick what to do next
Mon, Aug 26, 2024, 17:29:02 - deeznnutz: We are talking about
- interactive installation menu
- producer flag
- testnet flag
Mon, Aug 26, 2024, 17:29:26 - georgezgeorgez: Do you have an idea of how a menu would work?
Mon, Aug 26, 2024, 17:29:33 - deeznnutz: in addition to the things mentioned above to integrate znnd monitoring
Mon, Aug 26, 2024, 17:29:42 - georgezgeorgez: I think doing it in bash wouldn't be so pretty
Mon, Aug 26, 2024, 17:29:49 - deeznnutz: <@georgezgeorgez
.chat "Do you have an idea of how a men..."> I know how it won't work... lol
Mon, Aug 26, 2024, 17:30:08 - deeznnutz: I tried it and could not get one working with an install command with curl.
Mon, Aug 26, 2024, 17:30:28 - deeznnutz: maybe I just gave up too early
Mon, Aug 26, 2024, 17:30:34 - georgezgeorgez: https://github.com/charmbracelet/bubbletea If we do go in the direction of TUIs
Mon, Aug 26, 2024, 17:31:11 - deeznnutz: that would be awesome
Mon, Aug 26, 2024, 17:31:19 - georgezgeorgez: deeznnutz: but again, probably not the near term focus. What do you think we should try and have done before next meeting?
Mon, Aug 26, 2024, 17:31:59 - deeznnutz: my goal is to get the new datasource integrated and a custom znnd dashboard working
Mon, Aug 26, 2024, 17:32:06 - deeznnutz: that what I can work on.
Mon, Aug 26, 2024, 17:32:21 - georgezgeorgez: Cool that's what I was thinking too. Maybe at the very least, wireframe JSON for the dashboard? And some poc of the infinity plugin for at least one of the graphs
Mon, Aug 26, 2024, 17:32:25 - deeznnutz: and pull in your changes
Mon, Aug 26, 2024, 17:32:55 - deeznnutz: <@georgezgeorgez
.chat "Cool that's what I was thinking ..."> yes that is doable
Mon, Aug 26, 2024, 17:33:52 - deeznnutz: The TUI framework would be super cool.
Mon, Aug 26, 2024, 17:33:53 - georgezgeorgez: <@deeznnutz
.chat "and pull in your changes"> If we want to actually test it live, we would need to run an arm64 server
Mon, Aug 26, 2024, 17:34:12 - deeznnutz: <@georgezgeorgez
.chat "If we want to actually test it l..."> DO does not have them. So I can test on another platform
Mon, Aug 26, 2024, 17:34:14 - georgezgeorgez: Long term, I think it would make sense for us to have some test scripts that interact with cloud provider APIs to spin up nodes etc
Mon, Aug 26, 2024, 17:34:35 - georgezgeorgez: Run some tests, spit out data, and then tear it down
Mon, Aug 26, 2024, 17:34:52 - georgezgeorgez: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
Mon, Aug 26, 2024, 17:35:04 - coinselor: I can test the arm64 changes
Mon, Aug 26, 2024, 17:35:39 - deeznnutz: I need to add arm support for the grafana install too
Mon, Aug 26, 2024, 17:36:00 - deeznnutz: maybe we can all take a look at the TUI framework.
Mon, Aug 26, 2024, 17:36:22 - deeznnutz: between that and everything else going on I think this is doable in the next 2 weeks
Mon, Aug 26, 2024, 17:36:40 - georgezgeorgez: A lot of my devnet branch could be carved out. But let's get a dashboard out first, before we make things pretty
Mon, Aug 26, 2024, 17:36:58 - deeznnutz: cool. sounds like a plan
Mon, Aug 26, 2024, 17:37:04 - georgezgeorgez: I can help with the Infinity plugin and grafana dashboard JSON and whatever else really
Mon, Aug 26, 2024, 17:37:35 - deeznnutz: maybe you can take on the dashboard after I get the plugin installed and setup
Mon, Aug 26, 2024, 17:37:36 - georgezgeorgez: And yeah maybe before the next meeting, we can talk with other pillar operators
Mon, Aug 26, 2024, 17:38:02 - deeznnutz: I have time this week. I'm traveling T-TH next week.
Mon, Aug 26, 2024, 17:38:31 - georgezgeorgez: When do you think we should meet next?
Mon, Aug 26, 2024, 17:38:59 - deeznnutz: Sept 9 @ 6PM EST? does that work?
Mon, Aug 26, 2024, 17:39:32 - georgezgeorgez: Should be good
Mon, Aug 26, 2024, 17:39:41 - georgezgeorgez: coinselor hbu?
Mon, Aug 26, 2024, 17:39:51 - coinselor: Ye that works
Mon, Aug 26, 2024, 17:40:08 - deeznnutz: cool. sounds like a plan. thx everyone!!
Mon, Aug 26, 2024, 17:40:21 - georgezgeorgez: Anything else you want to go over? Or call it for today?
Mon, Aug 26, 2024, 17:40:47 - deeznnutz: I'm good. did you see my post on dynamic fusing?
Mon, Aug 26, 2024, 17:40:54 - deeznnutz: am I retarded?
Mon, Aug 26, 2024, 17:41:11 - georgezgeorgez: Not sure if either question is within scope of the SIG
Mon, Aug 26, 2024, 17:41:16 - georgezgeorgez: haha
Mon, Aug 26, 2024, 17:41:19 - deeznnutz: lol
Mon, Aug 26, 2024, 17:41:26 - deeznnutz: ya, we can chat about that elsewhere
Mon, Aug 26, 2024, 17:41:44 - georgezgeorgez: Thank you everyone. Thank you deeznnutz for facilitating as chair