记一次Nginx异常重启失败

一大早收到几个服务同时不可用的告警，一查又是 Nginx 出现启动失败，手动重启后恢复正常，这个情况过去的一年里，也出了一次，同样的时间，于是好好排查下。直接说结论，两个原因：一是系统底层更新，引发多服务重启，包括 nginx 在内。二是 nginx 一个 upstream 用了公网域名，解析失败，导致启动失败。

下面是过程，首先看 Nginx 日志，精简后如下：

[emerg] host not found in upstream "webapi.xxx.com" in /etc/nginx/sites-enabled/com.xxx.conf
nginx: configuration file /etc/nginx/nginx.conf test failed

Nignx 相关的配置，很早就没动过，比较迷惑的是手动执行 nginx -t 却都是 ok 的，唯独这次事故中出现了 test failed，仔细一看，上多了一条 host not found，或许刚好那个点就解析失败了，因为是遗留配置，没有使用，直接删除。

接下来就好玩了，哪个“缺德” Job 会在凌晨6点重启 Nginx？甚至这个时间点重启了好几个重要服务。印象中也就是 acme 刷新证书配置了 reload，一查时间对不上，crontab中也没有 Job 能去操作 Nginx。没有头绪的时候问了下 GPT，给出了一个关键点：系统定时服务，果然就有个时间很吻合，结果如下：

~# systemctl list-timers --all

LAST 2026-02-04 06:03:03 CST
UNIT apt-daily-upgrade.timer
ACTIVATES apt-daily-upgrade.service

看看这个 apt-daily-upgrade 究竟干了什么：

~# journalctl -u apt-daily-upgrade.service --since "2026-02-04 06:00:00" --until "2026-02-04 06:06:00" --no-pager -o short-iso

systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
apt.systemd.daily[4037440]: /usr/bin/unattended-upgrade:567: DeprecationWarning: This process (pid=4037440) is multi-threaded, use of fork() may lead to deadlocks in the child.

日志不多，但明显去了 unattended-upgrade，继续：

~# sudo grep -n "2026-02-04 06:0" /var/log/unattended-upgrades/unattended-upgrades.log | head -n 200

Starting unattended upgrades script
47:2026-02-04 06:03:13,556 INFO Packages that will be upgraded: inetutils-telnet libc-bin libc-dev-bin libc-devtools libc6 libc6-dev libpng16-16t64 locales mysql-client-core-8.0 telnet
49:2026-02-04 06:04:28,538 INFO All upgrades installed

升级了几个底层库，这时候意识到，应该就是这里引发了上层重启，继续：

~# sudo sed -n '1,200p' /var/log/unattended-upgrades/unattended-upgrades-dpkg.log

Restarting services...
systemctl restart aegis.service aliyun.service chrony.service containerd.service cron.service fwupd.service ilogtaild.service multipathd.service mysql.server.service nginx.service php8.3-fpm.service polkit.service redis-server.service rsyslog.service ssh.service systemd-journald.service systemd-networkd.service systemd-resolved.service systemd-udevd.service tuned.service udisks2.service uuidd.service
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xeu nginx.service" for details.

好家伙，重启了这么多服务。

这次事故里，“触发器”是自动升级导致的服务重启；“根因”是 Nginx 配置模式的脆弱性，upstream 配置在启动期就要求域名可解析，DNS/解析链路只要在重启瞬间抖动一下，Nginx 就起不来，于是任何一次重启（自动升级、人工重启、机器迁移、网络切换）都可能把隐患放大成故障。

本文链接地址：https://dorole.com/2302/

来自：Dorole's Blog

日	一	二	三	四	五	六
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

记一次Nginx异常重启失败

发布者

Steve

发表回复取消回复

发布者

Steve

发表回复 取消回复

发表回复取消回复