Edit2: Thanks all for your responses! I have checked the logs, https://lemmy.nz/comment/6192604, and based on that removed tracker-miner-fs as it’s a search/index tool which I don’t need. No idea why it took over all memory. I’ll also get a WiFi Smartplug as a kill switch. Hopefully that solves it. Thanks again heaps!
I’ve got a HP ProDesk G3 which I’m using as home server, I’ve installed Ubuntu on it. Earlier this week the services I host on it stopped (Immich & Frigate). I tried to SSH, but it just hung after asking for a password. I could ping it, but it was just unresponsive.
I had to force reboot it manually. This is fine, but I’m not always at home.
The chip has Intel vPro as far as I know, which could be an option, but I have no idea how this works. The documentation on the Intel site seems focused on enterprises. I tried to connect with RealVNC which does not work, so I think I’ve got to install/configure something on the server first.
I also asked Bing Chat but it came up with non existing packages & commands. Welcome your thoughts!
/edit: I just found this, which seems to be exactly what I need: https://manpages.ubuntu.com/manpages/focal/en/man7/amt-howto.7.html
What about if you use a smart home power socket?
Yes, very good idea. I’ve got HA on a RPI so that should be easy.
Good luck mate ✌️✌️
Thanks, it’s so awesome to see so many useful replies here! If you are interested, I found some very weird things in the logs :( https://lemmy.nz/comment/6192604
Check if your motherboard has a watchdog function. If the OS can’t ping the watchdog every 5 min or whatever you set it to, the board resets.
This is how we handled camera servers at one of my former jobs, we just setup HP SFF desktops with Windows and the software and turned on the watchdog timer, always did the trick when power outages or system hangups happened.
There’s a tale from long ago where someone set up a CD drive tray so that opening it would tap the reset button on a server.
Nice
Awesome, thanks for the link
Awesome, thanks for the link
Thanks, I’ve got a HP SFF as well. Not 100% sure how to turn it on though from Ubuntu. There’s a software based version: https://manpages.ubuntu.com/manpages/xenial/en/man8/watchdog.8.html
But I guess that’s not the one using the motherboard watchdog function.
You need an OS app to run and a setting in the BIOS. The app at the OS level gives a heartbeat to the watchdog module on the mother board. If you miss some heartbeats, the firmware on the motherboard sends the reset command.
You can set it in the BIOS, regardless of OS.
This is how you lose data. Hope you have a good backup on a NAS?
No, this is a tool that can be used in a well designed architecture. Would I do this with a single database server, probably not. Would I ever run a single database server? Also probably not.
Also, by this point, you’ve probably already kernel panicked or something. There’s not much left that can be saved and you probably needed that backup five minutes before the host came up.
Thanks! That should work.
A unifi power strip on a unifi network so you can control the power switch, and setting the motherboard to auto turn on after power failure. Though this is the nuclear option for restarting the system. Maybe while you’re at it, diagnose why it keeps hanging up on you.
Yeah think I’ll get a standalone WiFi smart plug, not connected to my Home Assistant, as a kill switch. But you’re right, it’s overkill.
I found some weird things in the logs, this goes beyond my knowledge :( See https://lemmy.nz/comment/6192604
But you’re right, it’s overkill.
I wouldn’t say that. Sure, it’s not the preferred way of restarting a system, but it is a good backup to have if nothing else works. Remotely messing up the network connections for example.
edit: I just found this, which seems to be exactly what I need: https://manpages.ubuntu.com/manpages/focal/en/man7/amt-howto.7.html
Ah yes, Intel’s famous security hole.
Some people stopped buying Intel CPUs after this feature was introduced.
Is AMD safer? or are these people buying something else?
Yeah, it’s called AMD DASH, but it’s available only on select CPUs, unlike Intel’s variant.
ARM I guess, or increasingly RISC-V
well kind of if you count pikvm
Ok, I grabbed a few screen shots for you as well. Here is a site that will link you to MEBx setup that enables AMT: http://h10032.www1.hp.com/ctg/Manual/c03883429
When power on your ProDesk G3, you can access the MEBx setup by pressing Ctrl+P or they also say F6 or Escape will get you there. Intel AMT runs on a different IP address than what your OS gets. You can assign DHCP or a static IP address and setup your admin password. You can then access the portal from http://ipaddress:16992 There should be a method of access what would show on the screen through a KVM like access but I use MeshCentral for that so I couldn’t tell you how to do it without.
Hopefully, that gives you a start. Feel free to reach back out if you have any questions. Thank you!
Thanks heaps, that’s is very useful. Will connect monitor and keyboard and have a look.
Glad I could help! 😃
deleted by creator
Yes, thanks for that. Good point. I checked the logs, and minutes before it crashed I can see below in the logs. Seems like either a GPU error or out of memory error. I’ve deleted tracker-miner-fs as I don’t need it. It also shows a massive list of processes with their memory usage.
Feb 21 17:27:49 hppd600-g3 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:0:00000000 Feb 21 17:32:43 hppd600-g3 kernel: 1305621 total pagecache pages Feb 21 17:32:43 hppd600-g3 kernel: 16258 pages in swap cache Feb 21 17:32:43 hppd600-g3 kernel: Free swap = 0kB Feb 21 17:32:43 hppd600-g3 kernel: Total swap = 1000444kB Feb 21 17:32:43 hppd600-g3 kernel: 2065206 pages RAM Feb 21 17:32:43 hppd600-g3 kernel: 0 pages HighMem/MovableOnly Feb 21 17:32:43 hppd600-g3 kernel: 64196 pages reserved Feb 21 17:32:43 hppd600-g3 kernel: 0 pages hwpoisoned Feb 21 17:32:43 hppd600-g3 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-113.slice/user@113.service/background.slice/tracker-miner-fs-3.service,task=t> Feb 21 17:32:43 hppd600-g3 kernel: Out of memory: Killed process 833 (tracker-miner-f) total-vm:625676kB, anon-rss:3144kB, file-rss:4816kB, shmem-rss:4kB, UID:113 pgtables:280kB oom_score_adj:200 Feb 21 17:32:43 hppd600-g3 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Maybe investigate why it hung?
That could be a sign of something bigger about to kill it altogether
Yes, thanks for that. Good point. I checked the logs, and minutes before it crashed I can see below in the logs. Seems like either a GPU error or out of memory error. No idea what tracker-miner-f is by the way. It also shows a massive list of processes with their memory usage.
This goes beyond my knowledge :(
Feb 21 17:27:49 hppd600-g3 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:0:00000000 Feb 21 17:32:43 hppd600-g3 kernel: 1305621 total pagecache pages Feb 21 17:32:43 hppd600-g3 kernel: 16258 pages in swap cache Feb 21 17:32:43 hppd600-g3 kernel: Free swap = 0kB Feb 21 17:32:43 hppd600-g3 kernel: Total swap = 1000444kB Feb 21 17:32:43 hppd600-g3 kernel: 2065206 pages RAM Feb 21 17:32:43 hppd600-g3 kernel: 0 pages HighMem/MovableOnly Feb 21 17:32:43 hppd600-g3 kernel: 64196 pages reserved Feb 21 17:32:43 hppd600-g3 kernel: 0 pages hwpoisoned Feb 21 17:32:43 hppd600-g3 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-113.slice/user@113.service/background.slice/tracker-miner-fs-3.service,task=t> Feb 21 17:32:43 hppd600-g3 kernel: Out of memory: Killed process 833 (tracker-miner-f) total-vm:625676kB, anon-rss:3144kB, file-rss:4816kB, shmem-rss:4kB, UID:113 pgtables:280kB oom_score_adj:200 Feb 21 17:32:43 hppd600-g3 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Tracker miner fs generates thumbnails for files iirc. There was a recent vulnerability where malicious files could crash it and execute code just by being on disk. Make sure you haven’t been hit by malware
I’ve uninstalled it, it’s an index/search tool. Don’t need it :D
Usually comes with your DE, sometimes removing it breaks your DE
Yeah tracker miner sounds dodgy. I’ve only installed Immich & Frigate on the box, and no dodgy repositories. It’s also auto updating. Will do research how to check for malware, thought that was a Windows only thing :D
I’ve previously had a problem with my server becoming unresponsive when running immich. It’s been a while, but I remember there being some kind of memory leak having to do with immich. It was in their GitHub issues and everything. On my system it would take about a day and a half and then ssh, along with everything else, would become unresponsive. Rebooting would fix it for a day and a half. I stopped running immich and it hasn’t happened since. I suppose you could try using a cron job to restart immich periodically and see if that resolves your problem.
That is good to know! Will keep an eye on memory usage of immich. I really like it, so I’m reluctant to let it go.
You could connect an ESP32 to the power and reset switches through opto-isolators or relays. You will have to do a little bit of programming, but you can host a website on the ESP32 that will allow you to operate the switches remotely.
If you want to get a bit fancier, you could connect the UART on the ESP32 to a serial port on the server through a TTL to RS-232 level converter and have a remote serial terminal embedded in the web page too. That won’t do much good if the server is completely locked up though.
remote kvm if you are relying on a box that no longer has a network connection you are SOL and need something that can power cycle the box.
If it hung like that, you probably have some sort of storage issue or high memory consumption pushing the box into swap.
Intel amt may help you, if you want hardware then google pikvm. Raritan also makes a small single node ip kvm, but it’ll probably cost more.
Thanks! Yeah it seemed to be an OOM issue, but based on my Kagi qualities it seems like an OS issue. But, it also has an error about the GPU. Normal memory usage is more than fine, so perhaps it was a one time thing. See logs: https://lemmy.nz/comment/6192604
On actual server motherboards (as opposed to repurposed home PC’s) there is sometimes a special KVM like interface (keyboard/video/mouse, not the VM hypervisor) so you can connect to it with VNC and have the equivalent of local access. This is called IDRAC on Dell servers and other vendors have something similar.
On a home PC, hmm, you might be able to set up some kind of remote power cycle and serial console connection, using a second computer (Raspberry Pi or the like). I’m unfamiliar with Intel AMT that you linked to, but it seems like another idea.
I do remember hearing of a DRAC-like board for PC’s but the name of it escapes me right now.
At the end of the day, if you want a long running server, you probably should host it in a data center, maybe with failover and other HA provisions. Home environments are a pain to set up for that. If your computer goes offline and you can’t reach it, how do you even know that your home isn’t having a power outage? Home ISP’s are flaky too, so maybe you want a backup route over mobile data, etc. Yes you can make workarounds for everything but it amounts to turning your home into a crappy low capacity data center.
PiKVM or a similar device could work for OP - is that what you are thinking of? I’ve used it and it works well.
I think a lot of people who self-host get caught up in the excitement of getting the services up and running and neglect disaster planning, prevention, and recovery (myself included). Either they put it off for later or don’t realize it could be a problem down the road until it happens. We always say not to self host anything you can’t live without, and most take that advice, others don’t. Not saying OP falls in either category, necessarily, just adding on to some of your points.
Self hosting really is the land of compromise where we all have to balance our requirements, budget, time and effort. Personally, I have a little disposable income that I spend on hardware to host non-critical services so I can learn and tinker. It could all go away and all I will have lost is the time and money I put into it, but I gained some knowledge and enjoyment. Needless to say, I don’t have much in the way of backups and monitoring.
PiKVM isn’t the board I was thinking of, but same idea, and maybe even better.
Thanks, but a data center is probably overkill for my needs. I’ve got it power loss protected with a UPS, and that’s more than enough for us. Thanks anyway :)
I have a RPI, but of course that one can hang too. I’ll buy a simple WiFi smart plug, standalone, as a kill switch.
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:
Fewer Letters More Letters HA Home Assistant automation software ~ High Availability IP Internet Protocol NAS Network-Attached Storage RPi Raspberry Pi brand of SBC SBC Single-Board Computer SSH Secure Shell for remote terminal access VNC Virtual Network Computing for remote desktop access
5 acronyms in this thread; the most compressed thread commented on today has 5 acronyms.
[Thread #533 for this sub, first seen 22nd Feb 2024, 04:35] [FAQ] [Full list] [Contact] [Source code]
Good bot
I’m not in front of my computer atm, but I think I have something that can help you out. I have a 3-node Lenovo Thin client cluster that I manage their KVMs using the Intel vPro. I even went a step further using MeshCentral running on a VM to centralize my KVM access since I have 3 of them, but that’s another story.
Anyway, I’ll see if I can grab you some URLs in the morning if someone else doesn’t beat me to it or you find it on your own running google queries.
Thanks mate. It was a bit of a rabbit hole, I found stuff about the watchdog package, and you can configure it to use the iTCO_wdt module, but I also read it was blacklisted, and then I just gave up. I posted somewhere else in the thread what lead up to the hang. And, I think I’ll buy a WiFi smartplug so I can remotely reboot everything; assuming the WiFi still works :D