Cannot SSH to EC2 Instance Disk Full

·

·

,

In this tutorial, we are going to learn how to tackle a situation where we are not able to SSH to an EC2 instance due to the disk being full. This usually happens when there is a high IO activity happening, or when the /var directory is full, especially when the /var directory is on a Logical Volume.

Create Rescue Instance

The first step is to create the rescue instance. We will use this instance to inspect the volume of the faulty instance.

A very important note to remember: The rescue instance has to be in the same availability zone as the faulty instance, since EBS volumes do not span availability zones.

We will also choose the same AMI as the faulty instance. It is not mandatory, but we wait to avoid any major differences between the working and the non working instance to make it easier for us to troubleshoot the issue.

Let’s name the tag “rescue-instance”.

As a good security practice, we will allow SSH only from your IP address to prevent any unwanted guests.

Once we have the rescue instance ready, we are good to go to the second step.

Detach and Attach Root Volume

The first step is to stop the instance. This will allow us to detach the root volume from the faulty instance and then attach it to the rescue instance we created in step 1.

On the EC2 console, select the instance that you stopped then click on Storage. This will allow us to go to the root volume and detach it.

We will then detach the volume, and attach it to the rescue instance. Note that both instances must be in the same availability zones.

Mount EBS Volume

Once we are done with attaching the EBS volume to the rescue instance, we will login to the new instance and try figure out what is happening.

Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added (ECDSA) to the list of known hosts.

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-172-31-9-120 ~]$

From the lsblk command, we can see our faulty volume attached:

[ec2-user@ip-172-31-9-120 ~]$ lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda    202:0    0   8G  0 disk
└─xvda1 202:1    0   8G  0 part /
xvdf    202:80   0   8G  0 disk
└─xvdf1 202:81   0   8G  0 part
[ec2-user@ip-172-31-9-120 ~]$

The df -hT command will list all the mounted devices.

[ec2-user@ip-172-31-9-120 ~]$ df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  484M     0  484M   0% /dev
tmpfs          tmpfs     492M     0  492M   0% /dev/shm
tmpfs          tmpfs     492M  408K  491M   1% /run
tmpfs          tmpfs     492M     0  492M   0% /sys/fs/cgroup
/dev/xvda1     xfs       8.0G  1.5G  6.6G  19% /
tmpfs          tmpfs      99M     0   99M   0% /run/user/1000
[ec2-user@ip-172-31-9-120 ~]$

We haven’t mounted ours yet. Let’s do that now. You will see that we have chosen the nouuid option to avoid having to worry about any similar UUIDs since we are using the same AMI for the rescue instance.

[ec2-user@ip-172-31-9-120 ~]$ sudo mkdir /rescue
[ec2-user@ip-172-31-9-120 ~]$ sudo mount -o nouuid /dev/xvdf1 /rescue

[ec2-user@ip-172-31-9-120 ~]$ df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  484M     0  484M   0% /dev
tmpfs          tmpfs     492M     0  492M   0% /dev/shm
tmpfs          tmpfs     492M  408K  491M   1% /run
tmpfs          tmpfs     492M     0  492M   0% /sys/fs/cgroup
/dev/xvda1     xfs       8.0G  1.5G  6.6G  19% /
tmpfs          tmpfs      99M     0   99M   0% /run/user/1000
/dev/xvdf1     xfs       8.0G  1.5G  6.6G  19% /rescue
[ec2-user@ip-172-31-9-120 ~]$

To confirm that the device is now mounted, we can run the below commands:

[ec2-user@ip-172-31-9-120 ~]$ cd /rescue
[ec2-user@ip-172-31-9-120 rescue]$ ll
total 16
lrwxrwxrwx  1 root root    7 Nov  8 20:51 bin -> usr/bin
dr-xr-xr-x  4 root root 4096 Nov  8 20:52 boot
drwxr-xr-x  3 root root  136 Nov  8 20:52 dev
drwxr-xr-x 80 root root 8192 Nov 11 23:12 etc
drwxr-xr-x  3 root root   22 Nov 11 23:12 home
lrwxrwxrwx  1 root root    7 Nov  8 20:51 lib -> usr/lib
lrwxrwxrwx  1 root root    9 Nov  8 20:51 lib64 -> usr/lib64
drwxr-xr-x  2 root root    6 Nov  8 20:51 local
drwxr-xr-x  2 root root    6 Apr  9  2019 media
drwxr-xr-x  2 root root    6 Apr  9  2019 mnt
drwxr-xr-x  4 root root   27 Nov  8 20:52 opt
drwxr-xr-x  2 root root    6 Nov  8 20:51 proc
dr-xr-x---  3 root root  103 Nov 11 23:12 root
drwxr-xr-x  2 root root    6 Nov  8 20:52 run
lrwxrwxrwx  1 root root    8 Nov  8 20:51 sbin -> usr/sbin
drwxr-xr-x  2 root root    6 Apr  9  2019 srv
drwxr-xr-x  2 root root    6 Nov  8 20:51 sys
drwxrwxrwt  7 root root   93 Nov 11 23:33 tmp
drwxr-xr-x 13 root root  155 Nov  8 20:51 usr
drwxr-xr-x 19 root root  269 Nov 11 23:11 var
[ec2-user@ip-172-31-9-120 rescue]$

Perfect! It is now time to check what is consuming all that space. Since my faulty instance was not actually faulty, I am not able to provide you with a test. But what I can do is provide you with the commands I would use to troubleshoot the issue.

How to Find Biggest Files and Directories in Linux

[ec2-user@ip-172-31-9-120 rescue]$ sudo du -ah /rescue | sort -n -r | head -n 10
1012K	/rescue/usr/share/vim/vim81/ftplugin
1008K	/rescue/usr/lib/python2.7/site-packages/pkg_resources
1008K	/rescue/usr/lib/python2.7/site-packages/cloudinit/config
1004K	/rescue/usr/share/man/man7
1004K	/rescue/usr/lib/modules/4.14.252-195.483.amzn2.x86_64/kernel/net/sunrpc
1004K	/rescue/usr/bin/grub2-mount
1000K	/rescue/usr/lib/python2.7/site-packages/botocore/data/ec2/2016-09-15
996K	/rescue/usr/lib/modules/4.14.252-195.483.amzn2.x86_64/kernel/net/netfilter/ipset
988K	/rescue/usr/lib/python2.7/site-packages/oauthlib
984K	/rescue/usr/lib/systemd/system
[ec2-user@ip-172-31-9-120 rescue]$

If you want more human readable output try:

$ cd /path/to/some/where
$ du -hsx * | sort -rh | head -10
$ du -hsx -- * | sort -rh | head -10

And to find out the top file sizes only:

# find -type f -exec du -Sh {} + | sort -rh | head -n 5

Final Thoughts

At this stage you are free to do whatever you want with these files. You can either remove those files and directories, or take a back or even move things around to free some space.

Once you have done your changes, you can detach the EBS volume and reattach it to the original instance.

Don’t forget to delete your testing instance to avoid incurring unnecessary costs!



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

One response to “Cannot SSH to EC2 Instance Disk Full”
  1. […] this point I will assume you know how to create an EC2 instance. You can refer to a previous article where we show exactly how to create an EC2 […]