记一次Nginx异常重启失败

一大早收到几个服务同时不可用的告警,一查又是 Nginx 出现启动失败,手动重启后恢复正常,这个情况过去的一年里,也出了一次,同样的时间,于是好好排查下。直接说结论,两个原因:一是系统底层更新,引发多服务重启,包括 nginx 在内。二是 nginx 一个 upstream 用了公网域名,解析失败,导致启动失败。

下面是过程,首先看 Nginx 日志,精简后如下:

[emerg] host not found in upstream "webapi.xxx.com" in /etc/nginx/sites-enabled/com.xxx.conf
nginx: configuration file /etc/nginx/nginx.conf test failed

Nignx 相关的配置,很早就没动过,比较迷惑的是手动执行 nginx -t 却都是 ok 的,唯独这次事故中出现了 test failed,仔细一看,上多了一条 host not found,或许刚好那个点就解析失败了,因为是遗留配置,没有使用,直接删除。

接下来就好玩了,哪个“缺德” Job 会在凌晨6点重启 Nginx?甚至这个时间点重启了好几个重要服务。印象中也就是 acme 刷新证书配置了 reload,一查时间对不上,crontab中也没有 Job 能去操作 Nginx。没有头绪的时候问了下 GPT,给出了一个关键点:系统定时服务,果然就有个时间很吻合,结果如下:

~# systemctl list-timers --all

LAST 2026-02-04 06:03:03 CST
UNIT apt-daily-upgrade.timer
ACTIVATES apt-daily-upgrade.service

看看这个 apt-daily-upgrade 究竟干了什么:

~# journalctl -u apt-daily-upgrade.service --since "2026-02-04 06:00:00" --until "2026-02-04 06:06:00" --no-pager -o short-iso

systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
apt.systemd.daily[4037440]: /usr/bin/unattended-upgrade:567: DeprecationWarning: This process (pid=4037440) is multi-threaded, use of fork() may lead to deadlocks in the child.

日志不多,但明显去了 unattended-upgrade,继续:

~# sudo grep -n "2026-02-04 06:0" /var/log/unattended-upgrades/unattended-upgrades.log | head -n 200

Starting unattended upgrades script
47:2026-02-04 06:03:13,556 INFO Packages that will be upgraded: inetutils-telnet libc-bin libc-dev-bin libc-devtools libc6 libc6-dev libpng16-16t64 locales mysql-client-core-8.0 telnet
49:2026-02-04 06:04:28,538 INFO All upgrades installed

升级了几个底层库,这时候意识到,应该就是这里引发了上层重启,继续:

~# sudo sed -n '1,200p' /var/log/unattended-upgrades/unattended-upgrades-dpkg.log

Restarting services...
systemctl restart aegis.service aliyun.service chrony.service containerd.service cron.service fwupd.service ilogtaild.service multipathd.service mysql.server.service nginx.service php8.3-fpm.service polkit.service redis-server.service rsyslog.service ssh.service systemd-journald.service systemd-networkd.service systemd-resolved.service systemd-udevd.service tuned.service udisks2.service uuidd.service
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xeu nginx.service" for details.

好家伙,重启了这么多服务。

这次事故里,“触发器”是自动升级导致的服务重启;“根因”是 Nginx 配置模式的脆弱性,upstream 配置在启动期就要求域名可解析,DNS/解析链路只要在重启瞬间抖动一下,Nginx 就起不来,于是任何一次重启(自动升级、人工重启、机器迁移、网络切换)都可能把隐患放大成故障。

本文链接地址:https://dorole.com/2302/

来自:Dorole's Blog

发布者

Steve

编程/摄影

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

:wink: :-| :-x :twisted: :) 8-O :( :roll: :-P :oops: :-o :mrgreen: :lol: :idea: :-D :evil: :cry: 8) :arrow: :-? :?: :!: