File sizes and disk usage
Published on May 06, 2020
Table of contents
- MTP, FUSE, and filesystem syscalls
- ext4 and sparse files
- fdisk: units, sector sizes and I/O sizes
- NVME device drivers
- stat: block sizes and fundamental block sizes
- fdisk attempt number two
- Summary
I was backing up my old phone, running android, so I could flash a new OS. I deleted most of the stuff via the phone itself, but then I wanted to back up some files.
MTP, FUSE, and filesystem syscalls
I connected it via usb to the laptop, and use jmtpfs
to mount the thing:
1
2
3
4
5
6
7
8
root@laptop:/# jmtpfs /mnt/external
Device 0 (VID=22b8 and PID=2e82) is a Motorola Moto G (ID2).
Android device detected, assigning default bug flags
root@laptop:/# ls -lah /mnt/external/
total 4.0K
drwxr-xr-x 3 root root 0 Jan 1 1970 .
drwxr-xr-x 6 root root 4.0K Jun 16 2017 ..
drwxr-xr-x 16 root root 0 Jul 19 2018 'Internal storage'
Sidenote: 1 Jan 1970, nice.
Then I wanted to know how much space I would need in my laptop to back up that
directory. Usually I du -sh .
, but I had a vague memory of that not working
in these partitions. I tried it anyway:
1
2
root@laptop:/m/e/Internal storage# du -sh .
0 .
Why?
I know that I have files in here that take up disk space, but even du
ing them
directly doesn’t work:
1
2
3
4
5
root@laptop:/m/e/Internal storage# ls -lh Download/.com.google.Chrome.X6yFrG
-rw-r--r-- 1 root root 125K Jul 3 2016 Download/.com.google.Chrome.X6yFrG
root@laptop:/m/e/Internal storage# du -h Download/.com.google.Chrome.X6yFrG
0 Download/.com.google.Chrome.X6yFrG
I wonder if du
uses a different syscall? or maybe it looks at other
properties? Let’s see what happens with strace.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
root@laptop:/m/e/Internal storage# strace du -h Download/.com.google.Chrome.X6yFrG > /dev/null
execve("/usr/bin/du", ["du", "-h", "Download/.com.google.Chrome.X6yF"...], 0x7fff26d27750 /* 22 vars */) = 0
brk(NULL) = 0x5611069c6000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=112615, ...}) = 0
mmap(NULL, 112615, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7cd07db000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 o\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1831600, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7cd07d9000
mmap(NULL, 1844568, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f7cd0616000
mmap(0x7f7cd063b000, 1351680, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f7cd063b000
mmap(0x7f7cd0785000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16f000) = 0x7f7cd0785000
mmap(0x7f7cd07cf000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x7f7cd07cf000
mmap(0x7f7cd07d5000, 13656, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7cd07d5000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7f7cd07da580) = 0
mprotect(0x7f7cd07cf000, 12288, PROT_READ) = 0
mprotect(0x5611055bd000, 4096, PROT_READ) = 0
mprotect(0x7f7cd081f000, 4096, PROT_READ) = 0
munmap(0x7f7cd07db000, 112615) = 0
brk(NULL) = 0x5611069c6000
brk(0x5611069e7000) = 0x5611069e7000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3044032, ...}) = 0
mmap(NULL, 3044032, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7cd032e000
close(3) = 0
newfstatat(AT_FDCWD, "Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}, AT_SYMLINK_NOFOLLOW) = 0
fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}) = 0
ioctl(1, TCGETS, 0x7ffef972ddd0) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "0\tDownload/.com.google.Chrome.X6"..., 37) = 37
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
So du uses newfstatat
, which, according to man stat
:
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
What does ls
use?
1
2
root@laptop:/m/e/Internal storage# strace ls -lh Download/.com.google.Chrome.X6yFrG 2>&1 > /dev/null | wc -l
193
Uff, this is a bit long. Let me filter it down:
1
2
3
4
5
root@laptop:/m/e/Internal storage# strace ls -lh Download/.com.google.Chrome.X6yFrG 2>&1 > /dev/null | grep Download/
execve("/bin/ls", ["ls", "-lh", "Download/.com.google.Chrome.X6yF"...], 0x7fff2c1c8b90 /* 22 vars */) = 0
lstat("Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}) = 0
lgetxattr("Download/.com.google.Chrome.X6yFrG", "security.selinux", 0x5630117d9f60, 255) = -1 EOPNOTSUPP (Operation not supported)
getxattr("Download/.com.google.Chrome.X6yFrG", "system.posix_acl_access", NULL, 0) = -1 EOPNOTSUPP (Operation not supported)
Seems to use lstat
. Hm, let’s take a look at them side by side:
1
2
newfstatat(AT_FDCWD, "Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}, AT_SYMLINK_NOFOLLOW) = 0
lstat("Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}) = 0
They both have st_size=127410
, so it doesn’t seem to be a difference in
syscall behavior. du
may be looking at another property of that structure. My
hypothesis is that it has something to do with blocks vs bytes. I’d like to
print the full structure filled by that syscall, does strace have an option for
that? Maybe stat
print something useful?
1
2
3
4
5
6
7
8
9
root@laptop:/m/e/Internal storage# stat Download/.com.google.Chrome.X6yFrG
File: Download/.com.google.Chrome.X6yFrG
Size: 127410 Blocks: 0 IO Block: 4096 regular file
Device: 34h/52d Inode: 31 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 1970-01-01 01:00:00.000000000 +0100
Modify: 2016-07-03 16:33:05.000000000 +0100
Change: 1970-01-01 01:00:00.000000000 +0100
Birth: -
Ah, Size=127410
and Blocks=0
. This could be it. Let me try with a non /mnt/external
file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
root@laptop:/m/e/Internal storage# ls -l ~/.bashrc
-rw-r--r-- 1 root root 646 Aug 1 2016 /root/.bashrc
root@laptop:/m/e/Internal storage# du -h ~/.bashrc
4.0K /root/.bashrc
root@laptop:/m/e/Internal storage# stat ~/.bashrc
File: /root/.bashrc
Size: 646 Blocks: 8 IO Block: 4096 regular file
Device: fe01h/65025d Inode: 3670019 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-05-06 14:47:33.290946241 +0100
Modify: 2016-08-01 16:51:58.622367982 +0100
Change: 2016-08-01 16:51:58.622367982 +0100
Birth: -
So we’ve got 646
from ls
, which matches stat’s Size
. du
prints 4.0K
,
which.. could be 512*8
, I guess. Let me read man du
.
Display values are in units of the first available SIZE from –block-size, and the DU_BLOCK_SIZE, BLOCK_SIZE and BLOCKSIZE environment variables. Otherwise, units default to 1024 bytes (or 512 if POSIXLY_CORRECT is set).
The SIZE argument is an integer and optional unit (example: 10K is 10*1024). Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,… (powers of 1000).
Meh, this is confusing. Let me check the source of du
. Gnu coreutils has a github mirror:
https://github.com/coreutils/coreutils/blob/master/src/du.c
1kloc, not bad. I wish I could look at a musl equivalent of coreutils.
1
2
3
/* If true, rather than using the disk usage of each file,
use the apparent size (a la stat.st_size). */
static bool apparent_size = false;
I saw something related to “apparent size” in man du
, so I’m guessing that if I
use that flag I’ll probably get du
to print the same as ls
, but I want to
figure out what it prints by default first. Where is this boolean used?
1
2
3
4
5
6
7
duinfo_set (&dui,
(apparent_size
? MAX (0, sb->st_size)
: (uintmax_t) ST_NBLOCKS (*sb) * ST_NBLOCKSIZE),
(time_type == time_mtime ? get_stat_mtime (sb)
: time_type == time_atime ? get_stat_atime (sb)
: get_stat_ctime (sb)));
OK, so we have st_size
for apparent size, and ST_NBLOCKS(*sb) *
ST_NBLOCKSIZE
by default. Googling for ST_NBLOCKS
got me a stackoverflow Q&A:
https://unix.stackexchange.com/questions/521151/why-is-st-blocks-always-reported-in-512-byte-blocks
Why is st_blocks always reported in 512-byte blocks? I was debugging a fuse filesystem that was reporting wrong sizes for du. It turned out that it was putting st_size / st_blksize [*] into st_blocks of the stat structure. The Linux manual page for stat(2) says:
This seems to be a common problem.
The size of a block is implementation-specific. On Linux it’s always 512 bytes, for historical reasons; in particular, it used to be the typical size of a disk sector.
So this seems to match what I thought, it’s 8*512
. Out of curiosity, let me
try to find ST_NBLOCKSIZE
.
Found something similar:
1
2
3
hugopeixoto@laptop:~$ ack S_BLKSIZE /usr/include/
/usr/include/x86_64-linux-gnu/sys/stat.h
199:# define S_BLKSIZE 512 /* Block size for `st_blocks'. */
From the internets, ST_NBLOCKSIZE
seems to be #define
d as S_BLKSIZE
, so
this is probably it.
https://github.com/rofl0r/gnulib/blob/master/lib/stat-size.h#L105
And ST_NBLOCKS is defined as:
1
2
3
# define ST_NBLOCKS(statbuf) \
(S_ISREG ((statbuf).st_mode) || S_ISDIR ((statbuf).st_mode) \
? (statbuf).st_blocks * ST_BLKSIZE (statbuf) / ST_NBLOCKSIZE : 0)
And ST_BLKSIZE
(not to be confused with (S_BLKSIZE
) is, most of the times,
st_blksize
. So du
calculates disk usage as:
1
2
3
4
5
ST_NBLOCKS(*sb) * ST_NBLOCKSIZE =
(*sb).st_blocks * ST_BLKSIZE(*sb) / ST_NBLOCKSIZE * ST_NBLOCKSIZE =
sb->st_blocks * sb->st_blksize / S_BLKSIZE * S_BLKSIZE =
sb->st_blocks * sb->st_blksize / 512 * 512 =
sb->st_blocks * sb->st_blksize
Sidenote: st_blksize is a multiple of 512, that’s why we’re able to simplify it by removing / 512 * 512.
So through a bunch of indirections, this ends up returning file system information directly.
Let me try the POSIXLY_CORRECT
thing:
1
2
3
4
hugopeixoto@laptop:~$ du history/bash
4992 history/bash
hugopeixoto@laptop:~$ POSIXLY_CORRECT=1 du history/bash
9984 history/bash
OK, that checks out. The number of blocks doubles, because it uses 512 instead of 1024.
Back to the apparent size thing:
1
2
3
4
root@laptop:/m/e/Internal storage# du -h --apparent-size Download/.com.google.Chrome.X6yFrG
125K Download/.com.google.Chrome.X6yFrG
root@laptop:/m/e/Internal storage# du --apparent-size -sh .
805M .
Seems OK. Now, why does stat
return Blocks=0
? This should be related to jmtpfs
. man jmtpfs
:
jmtpfs is a FUSE and libmtp based filesystem for accessing MTP (Media Transfer Protocol) devices. It was specifically designed for exchaning files between Linux (and Mac OS X) systems and newer An‐ droid devices that support MTP but not USB Mass Storage.
So, my current guess is that libmtp
does not return filesystem related stuff
and the FUSE part defaults to zero. Checking this by downloading the source for
jmtpfs
and grepping for st_blocks
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
hugopeixoto@laptop:~/w/contrib$ apt source jmtpfs
Reading package lists... Done
NOTICE: 'jmtpfs' packaging is maintained in the 'Git' version control system at:
git://anonscm.debian.org/collab-maint/jmtpfs.git
Please use:
git clone git://anonscm.debian.org/collab-maint/jmtpfs.git
to retrieve the latest (possibly unreleased) updates to the package.
Need to get 148 kB of source archives.
Get:1 http://ftp.pt.debian.org/debian testing/main jmtpfs 0.5-2 (dsc) [1,874 B]
Get:2 http://ftp.pt.debian.org/debian testing/main jmtpfs 0.5-2 (tar) [143 kB]
Get:3 http://ftp.pt.debian.org/debian testing/main jmtpfs 0.5-2 (diff) [3,321 B]
Fetched 148 kB in 0s (473 kB/s)
dpkg-source: info: extracting jmtpfs in jmtpfs-0.5
dpkg-source: info: unpacking jmtpfs_0.5.orig.tar.gz
dpkg-source: info: unpacking jmtpfs_0.5-2.debian.tar.gz
hugopeixoto@laptop:~/w/contrib$ ack st_nblocks jmtpfs-0.5/src/
hugopeixoto@laptop:~/w/contrib$
No luck on the first try. I’ll browse the source a bit. Oh, maybe I didn’t found any matches because it isn’t set at all.
1
2
3
4
5
6
7
8
$ ack block jmtpfs-0.5/
jmtpfs-0.5/src/MtpNode.cpp
131: stat->f_bsize = 512; // We have to pick some block size, so why not 512?
132: stat->f_blocks = storageInfo.maxCapacity / stat->f_bsize;
jmtpfs-0.5/src/MtpRoot.cpp
111: stat->f_bsize = 512; // We have to pick some block size, so why not 512?
112: stat->f_blocks = totalSize / stat->f_bsize;
Hm, f_blocks
? What’s the difference between an f_*
and a st_*
?
So, this uses struct statvfs
. I guess that’s what FUSE should use. But it
seems to be setting f_blocks
to 512
, so something is up.
1
2
3
4
5
hugopeixoto@laptop:~/w/c/j/src$ ack st_size
MtpFile.cpp
55: info.st_size = localFile->getSize();
58: info.st_size = md.self.filesize;
103: if (info.st_size == length)
OK, so we do have a st_size
being set. And no st_blocks
. What’s MTP anyway?
Does it report blocks? (probably not)
https://en.wikipedia.org/wiki/Media_Transfer_Protocol
The Media Transfer Protocol (MTP) is an extension to the Picture Transfer Protocol (PTP) communications protocol that allows media files to be transferred atomically to and from portable devices.
…
MTP is a key part of WMDRM10-PD,[1] a digital rights management (DRM) service for the Windows Media platform.
…
In 2011, it became the standard method to transfer files from/to Android
…
The USB Implementers Forum device working group standardised MTP as a full-fledged Universal Serial Bus (USB) device class in May 2008.[4] Since then MTP is an official extension to PTP and shares the same class code.[5]
Cool?
Comparison with USB Mass Storage
Ah, this sounds promising.
File oriented instead of block oriented protocol By not exposing the filesystem and metadata index, the integrity of these is in full control of the device.
Well, there you go.
None of my devices support USB Mass Storage. And from this SO answer:
https://android.stackexchange.com/questions/190138/how-to-use-usb-mass-storage-mode-on-android-4-3
If device implementations have a USB port with USB peripheral mode support, they:
MAY use USB mass storage, but SHOULD use Media Transfer Protocol to satisfy this requirement.
It seems like this is by design.
Now, back to strace
. Is there a way to print the full structure? man
strace
, --no-abbrev
seems like a good candidate:
1
2
3
4
5
6
--no-abbrev
Print unabbreviated versions of environment, stat, termios, etc. calls. These
structures are very common in calls and so the default behavior displays a
reasonable subset of structure members. Use this option to get all of the gory
details.
Let’s give it a go:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
root@laptop:/m/e/Internal storage# strace --no-abbrev -- du -h Download/.com.google.Chrome.X6yFrG 2>&1 > /dev/null | grep newfstata
newfstatat(AT_FDCWD, "Download/.com.google.Chrome.X6yFrG", {
st_dev=makedev(0, 0x34),
st_ino=4,
st_mode=S_IFREG|0644,
st_nlink=1,
st_uid=0,
st_gid=0,
st_blksize=4096,
st_blocks=0,
st_size=127410,
st_atime=0,
st_atime_nsec=0,
st_mtime=1467559985 /* 2016-07-03T16:33:05+0100 */,
st_mtime_nsec=0,
st_ctime=0,
st_ctime_nsec=0
}, AT_SYMLINK_NOFOLLOW) = 0
There is it (indented for readability). strace -v
seems to do the same. Not
only the blocks are zero, but there are some timestamp related fields also at
zero. Here’s the same for a non MTP file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
root@laptop:/m/e/Internal storage# strace -v -- du -h ~/.bashrc 2>&1 > /dev/null | grep newfstata
newfstatat(AT_FDCWD, "/root/.bashrc", {
st_dev=makedev(0xfe, 0x1),
st_ino=3670019,
st_mode=S_IFREG|0644,
st_nlink=1,
st_uid=0,
st_gid=0,
st_blksize=4096,
st_blocks=8,
st_size=646,
st_atime=1588772853 /* 2020-05-06T14:47:33.290946241+0100 */,
st_atime_nsec=290946241,
st_mtime=1470066718 /* 2016-08-01T16:51:58.622367982+0100 */,
st_mtime_nsec=622367982,
st_ctime=1470066718 /* 2016-08-01T16:51:58.622367982+0100 */,
st_ctime_nsec=622367982
}, AT_SYMLINK_NOFOLLOW) = 0
Curious that there’s a st_blksize
that is not related to st_blocks
. As
explained in one of the above stackoverflow links:
It indicates the preferred size for I/O, i.e. the amount of data that should be transferred in one operation for optimal results (ignoring other layers in the I/O stack).
I wonder what the purpose of st_blocks
is. It can’t be just st_size / 512
.
That would be kind of pointless, right?
Found a thread from 1992, but it doesn’t help much.
https://groups.google.com/forum/#!topic/comp.unix.programmer/7saTJ9gRBEM
This could be related to sparse files.
ext4 and sparse files
Let me look that up and try some cli shenanigans. From the Arch wiki:
https://wiki.archlinux.org/index.php/sparse_file
Creating sparse files The truncate utility can create sparse files. This command creates a 512 MiB sparse file:
$ truncate -s 512M file.img
1
2
3
4
5
6
7
8
9
10
11
12
13
14
hugopeixoto@laptop:~/w/c/j/src$ truncate -s 512M file.img
hugopeixoto@laptop:~/w/c/j/src$ du -sh --apparent-size file.img
512M file.img
hugopeixoto@laptop:~/w/c/j/src$ du -sh file.img
0 file.img
hugopeixoto@laptop:~/w/c/j/src$ stat file.img
File: file.img
Size: 536870912 Blocks: 0 IO Block: 4096 regular file
Device: fe01h/65025d Inode: 2904608 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/hugopeixoto) Gid: ( 1000/hugopeixoto)
Access: 2020-05-06 17:11:22.171650277 +0100
Modify: 2020-05-06 17:11:22.171650277 +0100
Change: 2020-05-06 17:11:22.171650277 +0100
Birth: -
Hm, so the size is indeed different from Blocks * 512
. This feels like
filesystem specific stuff. I can now either:
- look through ext4’s source code
- or look for a tool that prints fs data and metadata
I have the file’s inode thanks to stat
: 2904608. I’m using ext4, so I can use
the debugfs
tool, from apt install e2progs
. This allows me to run different
debug commands.
1
2
3
4
root@laptop:/# debugfs -R 'imap <2904608>' /dev/mapper/laptop--vg-root
debugfs 1.45.6 (20-Mar-2020)
Inode 2904608 is part of block group 354
located at block 11535681, offset 0x0f00
Can I read this block? I need the block size, to know where to seek, and the inode size, to know how much to read.
1
2
3
4
5
6
7
8
9
root@laptop:/# debugfs -R 'stats' /dev/mapper/laptop--vg-root | grep size | head -n9
debugfs 1.45.6 (20-Mar-2020)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Block size: 4096
Fragment size: 4096
Flex block group size: 16
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
So.. I should seek to 115356814096+0x0f00 and read 256 bytes? 0xf00 is
15256=3840, so that’s inside the 4096 limit. It’s the last inode in this
block, apparently. It’s kind of a large number, though. Tried reading this with dd
, didn’t work.
1
2
3
4
5
root@laptop:/# dd if=/dev/mapper/laptop--vg-root bs=1 seek=47250153216 count=256
dd: 'standard output': cannot seek: Illegal seek
0+0 records in
0+0 records out
0 bytes copied, 0.000325579 s, 0.0 kB/s
I hope I don’t accidentally my main partition. This could have to do with the fact that I’m using an LVM encrypted volume. Maybe seeks are not allowed.
Crap. I should use skip
when reading, not seek
. seek
is for writing. I
hope I didn’t jumble anything.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
root@laptop:/# dd if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 | xxd
256+0 records in
256+0 records out
256 bytes copied, 0.00061617 s, 415 kB/s
00000000: a481 e803 0000 0020 aae1 b25e aae1 b25e ....... ...^...^
00000010: aae1 b25e 0000 0000 e803 0100 0000 0000 ...^............
00000020: 0000 0800 0100 0000 0af3 0000 0400 0000 ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000060: 0000 0000 64e2 2b6f 0000 0000 0000 0000 ....d.+o........
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000080: 2000 0000 94b3 ec28 94b3 ec28 94b3 ec28 ......(...(...(
00000090: aae1 b25e 94b3 ec28 0000 0000 0000 0000 ...^...(........
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
So this is my inode. Cool. Let me try to write a single byte at pos=0 in the
sparse file and try again (checking if the inode changes before running dd
again).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
hugopeixoto@laptop:~/w/c/j/src$ echo -n "*" | dd of=file.img bs=1 seek=0 count=1
1+0 records in
1+0 records out
1 byte copied, 0.000335263 s, 3.0 kB/s
hugopeixoto@laptop:~/w/c/j/src$ stat file.img | grep Inode
Device: fe01h/65025d Inode: 2904608 Links: 1
root@laptop:/# dd if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 | xxd
256+0 records in
256+0 records out
256 bytes copied, 0.00263862 s, 97.0 kB/s
00000000: a481 e803 0100 0000 aae1 b25e e7f9 b25e ...........^...^
00000010: e7f9 b25e 0000 0000 e803 0100 0800 0000 ...^............
00000020: 0000 0800 0100 0000 0af3 0100 0400 0000 ................
00000030: 0000 0000 0000 0000 0100 0000 bc6d 5a00 .............mZ.
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000060: 0000 0000 64e2 2b6f 0000 0000 0000 0000 ....d.+o........
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000080: 2000 0000 00b1 c3df 00b1 c3df 94b3 ec28 ..............(
00000090: aae1 b25e 94b3 ec28 0000 0000 0000 0000 ...^...(........
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
OK, inode stayed the same. I don’t see the *
in here, which makes sense. File
contents are not inlined.
What’s the difference between these two? I need to know what these fields mean first.
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inode_Table
Let’s see if I can decode the timestamps.
Access time, offset 0x8, size=0x4
1
2
3
hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 |
> ruby -e "puts Time.at(STDIN.read[0x8...(0x8+4)].unpack('V')[0])"
2020-05-06 17:11:22 +0100
Sidenote: https://ruby-doc.org/core-2.7.1/String.html#method-i-unpack
Looks sane. OK, now that I know I’m reading the right thing, let’s see what else is in this table.
| 0x1C | __le32 | i_blocks_lo | Lower 32-bits of “block” count. If the huge_file feature flag is not set on the filesystem, the file consumes i_blocks_lo 512-byte blocks on disk.
1
2
3
hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 |
> ruby -e "puts STDIN.read[0x1C...(0x1C+4)].unpack('V')"
8
So this is currently taking 8 blocks? What does stat
say again?
1
2
hugopeixoto@laptop:~/w/c/j/src$ stat file.img | grep Blocks
Size: 1 Blocks: 8 IO Block: 4096 regular file
OK, so stat
reads what’s in this table. The 512 thing is referenced in the
ext4 docs again.
What if I now write a byte to the end of the file? What changes?
Oh no, I just noticed that when I wrote to the first byte, I lost the sparseness of the file:
1
2
hugopeixoto@laptop:~/w/c/j/src$ du --apparent-size file.img
1 file.img
That’s fine. The inode exploration still stands. I should have used conv=notrunc
:
1
2
3
4
5
6
7
8
9
10
hugopeixoto@laptop:~$ echo -n "*" | dd of=file.img bs=1 seek=0 count=1 conv=sparse conv=notrunc
hugopeixoto@laptop:~$ stat file.img
File: file.img
Size: 536870912 Blocks: 8 IO Block: 4096 regular file
Device: fe01h/65025d Inode: 2901111 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/hugopeixoto) Gid: ( 1000/hugopeixoto)
Access: 2020-05-06 20:40:26.700433373 +0100
Modify: 2020-05-06 20:40:29.600453655 +0100
Change: 2020-05-06 20:40:29.600453655 +0100
Birth: -
OK, ready to go. New offsets: inode=2901111, offset=47249257984
Writing last byte:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
hugopeixoto@laptop:~$ echo -n "*" | dd of=file.img bs=1 seek=536870911 count=1 conv=notrunc
hugopeixoto@laptop:~$ stat file.img
File: file.img
Size: 536870912 Blocks: 16 IO Block: 4096 regular file
Device: fe01h/65025d Inode: 2901111 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/hugopeixoto) Gid: ( 1000/hugopeixoto)
Access: 2020-05-06 20:40:26.700433373 +0100
Modify: 2020-05-06 20:43:57.649962142 +0100
Change: 2020-05-06 20:43:57.649962142 +0100
Birth: -
hugopeixoto@laptop:~$ du -h --apparent-size file.img
512M file.img
hugopeixoto@laptop:~$ du -h file.img
8.0K file.img
Size stays at 512M, but disk usage is low. How does the file system handle this? Let’s take a look at the inode entry again (since I had to recreate the file anyway):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47249257984 count=256 | xxd
00000000: a481 e803 0000 0020 aa12 b35e 7d13 b35e ....... ...^}..^
00000010: 7d13 b35e 0000 0000 e803 0100 1000 0000 }..^............
00000020: 0000 0800 0100 0000 0af3 0200 0400 0000 ................
00000030: 0000 0000 0000 0000 0100 0000 0038 ab02 .............8..
00000040: ffff 0100 0100 0000 0000 ad02 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000060: 0000 0000 fc40 b686 0000 0000 0000 0000 .....@..........
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000080: 2000 0000 788a f69a 788a f69a 740f ffa6 ...x...x...t...
00000090: aa12 b35e 740f ffa6 0000 0000 0000 0000 ...^t...........
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
There’s a flags field, let’s see what’s up there. Also, I need an alias to get the block info.
1
2
3
hugopeixoto@laptop:~$ alias getblock="sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47249257984 count=256"
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x20..(0x20+4)].unpack('V')[0].to_s(16)"
80000
0x80000
means that the inode uses extents:
In ext4, the file to logical block map has been replaced with an extent tree. Under the old scheme, allocating a contiguous run of 1,000 blocks requires an indirect block to map all 1,000 entries; with extents, the mapping is reduced to a single struct ext4_extent with ee_len = 1000.
So, what I’m getting from this is that file contents are in blocks, and there’s
a map somewhere to point file block 1 to disk block X, file block 2 to disk
block Y, etc. Before there had to be an entry for every file block. Now, if
contiguous file blocks are in contiguous disk blocks, this can be compressed
using extents. Let’s see where we can find that information for file.img
. Where do we start?
| 0x28 | 60 bytes | i_block[EXT4_N_BLOCKS=15] | Block map or extent tree. See the section “The Contents of inode.i_block”.
1
2
3
4
5
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)]" | xxd
00000000: 0af3 0200 0400 0000 0000 0000 0000 0000 ................
00000010: 0100 0000 0038 ab02 ffff 0100 0100 0000 .....8..........
00000020: 0000 ad02 0000 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 0000 0000 0000 0000 fc0a ..............
There’s a 0af3
, which matches the documentation (considering endianess):
0x0 __le16 eh_magic Magic number, 0xF30A.
Let’s extract the full header:
1
2
3
4
5
6
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)].unpack('H4vvvV')"
0af3
2
4
0
0
Magic number, check. 2 entries, maximum of 4, depth level 0. The depth level is important:
Depth of this extent node in the extent tree. 0 = this extent node points to data blocks
So, this could be 2 entries pointing to one datablock each. One for the
beginning of the sparse file and one for the end of the sparse file, maybe? If
each data block is 4096 bytes, that would be 8k disk usage. These two entries
point directly to data blocks, so they’re of the type struct ext4_extent
:
1
2
3
4
| 0x0 | __le32 | ee_block | First file block number that this extent covers.
| 0x4 | __le16 | ee_len | Number of blocks covered by extent. [...]
| 0x6 | __le16 | ee_start_hi | Upper 16-bits of the block number to which this extent points.
| 0x8 | __le32 | ee_start_lo | Lower 32-bits of the block number to which this extent points.
Let’s get their info:
1
2
3
4
5
6
7
8
9
10
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)][0xC...(0xC+12)].unpack('VvvV')"
0
1
0
44775424
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)][0xC+12...(0xC+12+12)].unpack('VvvV')"
131071
1
0
44892160
So the first extent points to the first file block, and it matches the disk block 44775424. The second extent points to the file block 131071, and it matches the disk block 44892160. Cute!
Let’s try to read the data blocks:
1
2
3
4
5
6
7
8
hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=4096 skip=44775424 count=1 | xxd | head -n3
00000000: 2a00 0000 0000 0000 0000 0000 0000 0000 *...............
00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=4096 skip=44892160 count=1 | xxd | tail -n3
00000fd0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000fe0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000ff0: 0000 0000 0000 0000 0000 0000 0000 002a ...............*
There are my asterisks. Notice head vs tail.
Back to the 512 byte block size. What’s up with that? Apparently, according to the internet, it matches the physical sector size of older hard drives.
https://askubuntu.com/questions/1144535/is-the-default-512-byte-physical-sector-size-appropriate-for-ssd-disks-under-lin
Nowadays disks use 4096 byte blocks, but to keep everything compatible, we still use 512?
1
2
3
4
5
6
7
8
9
10
11
12
hugopeixoto@laptop:~$ stat -f /dev/nvme0
File: "/dev/nvme0"
ID: 0 Namelen: 255 Type: tmpfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 996251 Free: 996251 Available: 996251
Inodes: Total: 996251 Free: 995747
hugopeixoto@laptop:~$ sudo fdisk -l /dev/nvme0n1p3
Disk /dev/nvme0n1p3: 237.75 GiB, 255266390016 bytes, 498567168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
So stat
ting the disk reports a bunch of block sizes = 4096. It matches the IO
blocks that show up when stat
ting a file. fdisk
ing the partition, though,
displays 512
all over the place.
I wonder what kind of syscalls, or metadata, all of these come from.
fdisk: units, sector sizes and I/O sizes
1
2
3
4
5
6
7
8
9
10
11
root@laptop:~# strace -v fdisk -l /dev/nvme0n1 2>&1 >/dev/null| grep 512
ioctl(3, BLKIOMIN, [512]) = 0
ioctl(3, BLKIOOPT, [512]) = 0
ioctl(3, BLKPBSZGET, [512]) = 0
ioctl(3, BLKSSZGET, [512]) = 0
ioctl(3, BLKSSZGET, [512]) = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
lseek(3, 512, SEEK_SET) = 512
read(3, "EFI PART\0\0\1\0\\\0\0\0\300W\215\226\0\0\0\0\1\0\0\0\0\0\0\0"..., 512) = 512
read(3, "EFI PART\0\0\1\0\\\0\0\0\16\372\2731\0\0\0\0\2572\317\35\0\0\0\0"..., 512) = 512
BLKIOMIN
and BLKIOOPT
are probably I/O size (minimum/optimal)
BLKSSZGET
and BLKPBSZGET
are Sector size (logical/physical)
Unsure about that Units: sector of 1 * 512
.
Sidenote: I just learned of
strace -s N
, which increases the default string size before it gets ellipsed.
I’ll check fdisk
’s source, with every value being 512 it’s a bit hard to
understand where Units
comes from.
From a random mirror:
https://github.com/karelzak/util-linux/blob/master/disk-utils/fdisk-list.c#L76
1
2
3
4
5
6
7
8
9
10
11
12
13
fdisk_info(cxt, _("Units: %s of %d * %ld = %ld bytes"),
fdisk_get_unit(cxt, FDISK_PLURAL),
fdisk_get_units_per_sector(cxt),
fdisk_get_sector_size(cxt),
fdisk_get_units_per_sector(cxt) * fdisk_get_sector_size(cxt));
fdisk_info(cxt, _("Sector size (logical/physical): %lu bytes / %lu bytes"),
fdisk_get_sector_size(cxt),
fdisk_get_physector_size(cxt));
fdisk_info(cxt, _("I/O size (minimum/optimal): %lu bytes / %lu bytes"),
fdisk_get_minimal_iosize(cxt),
fdisk_get_optimal_iosize(cxt));
It seems to use sector_size
(ioctl BLKSSZGET
) in Units
as well. Trying to
figure out where the actual ioctl
is called. Currently going through
topology.c
and topology/ioctl.c
.
Tracing it to probe.c:blkid_probe_get_sectorsize
. Onto
lib/blkdev.c:blkdev_get_sector_size
, and here it is. Unsure why this is so
different from the other three, which are defined in topology/ioctl.c
. Saving
syscalls, maybe.
NVME device drivers
ioctl
calls are usually handled by the device driver. I suppose my SSD
requires a device driver, but I have no idea what it is.
https://unix.stackexchange.com/questions/248494/how-to-find-the-driver-module-associated-with-sata-device-on-linux
Use
udevadm info
as described in the other answer to the link you mentioned
I know that udev
handles devices and hotplug, so I guess it makes sense that
it can list the device drivers.
1
2
3
4
5
6
7
8
9
root@laptop:~# udevadm info -a -n /dev/nvme0 | egrep looking\|DRIVER
looking at device '/devices/pci0000:00/0000:00:1d.0/0000:3c:00.0/nvme/nvme0':
DRIVER==""
looking at parent device '/devices/pci0000:00/0000:00:1d.0/0000:3c:00.0':
DRIVERS=="nvme"
looking at parent device '/devices/pci0000:00/0000:00:1d.0':
DRIVERS=="pcieport"
looking at parent device '/devices/pci0000:00':
DRIVERS==""
I suppose that the relevant part is the nvme
.
https://github.com/torvalds/linux/tree/master/drivers/nvme
A host
and a target
directory. What are those? They both seem small. Can I
shallowly clone this directory? I don’t want to clone the full kernel tree.
Meh, maybe it’s fast. Trying to shallow clone.
1
2
3
4
5
6
7
8
9
hugopeixoto@laptop:~/w/contrib$ git clone https://github.com/torvalds/linux --depth 1 --no-checkout
Cloning into 'linux'...
remote: Enumerating objects: 72113, done.
remote: Counting objects: 100% (72113/72113), done.
remote: Compressing objects: 100% (67592/67592), done.
remote: Total 72113 (delta 5341), reused 24234 (delta 3791), pack-reused 0
Receiving objects: 100% (72113/72113), 190.56 MiB | 6.87 MiB/s, done.
Resolving deltas: 100% (5341/5341), done.
hugopeixoto@laptop:~/w/c/linux$ git checkout master
A minute or so, not bad. Found this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// in block/ioctl.c:
/*
* Common commands that are handled the same way on native and compat
* user space. Note the separate arg/argp parameters that are needed
* to deal with the compat_ptr() conversion.
*/
static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
unsigned cmd, unsigned long arg, void __user *argp)
{
unsigned int max_sectors;
switch (cmd) {
// [...]
case BLKSSZGET: /* get block device logical block size */
return put_int(argp, bdev_logical_block_size(bdev));
// [...]
}
}
And that leads to:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// in include/linux/blkdev.h
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
int retval = 512;
if (q && q->limits.logical_block_size)
retval = q->limits.logical_block_size;
return retval;
}
static inline unsigned int bdev_logical_block_size(struct block_device *bdev)
{
return queue_logical_block_size(bdev_get_queue(bdev));
}
There’s a hardcoded 512 again. Does anyone set limits.logical_block_size
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// in block/blk-settings.c
void blk_set_default_limits(struct queue_limits *lim)
{
// [...]
lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
// [...]
}
// [...]
void blk_queue_logical_block_size(struct request_queue *q, unsigned int size)
{
q->limits.logical_block_size = size;
if (q->limits.physical_block_size < size)
q->limits.physical_block_size = size;
if (q->limits.io_min < q->limits.physical_block_size)
q->limits.io_min = q->limits.physical_block_size;
}
// [...]
int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t start)
{
// [...]
t->logical_block_size = max(t->logical_block_size, b->logical_block_size);
// [...]
}
Another default, a setter, and something about “stacked drivers like MD and DM”. The NVME driver has a few calls to the setter:
1
2
3
4
5
6
7
hugopeixoto@laptop:~/w/c/l/d/nvme$ ack blk_queue_logical_block_size
host/multipath.c
383: blk_queue_logical_block_size(q, 512);
host/core.c
1843: blk_queue_logical_block_size(disk->queue, bs);
3594: blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
Let’s see what’s up in core.c
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static void nvme_update_disk_info(struct gendisk *disk, struct nvme_ns *ns, struct nvme_id_ns *id)
{
// [...]
unsigned short bs = 1 << ns->lba_shift;
// [...]
if (ns->lba_shift > PAGE_SHIFT) {
/* unsupported block size, set capacity to 0 later */
bs = (1 << 9);
}
// [...]
blk_queue_logical_block_size(disk->queue, bs);
// [...]
}
// [...]
static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
// [...]
ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
// [...]
}
So, apart from another pair of 512s, it may be the result of reading from ns->lba_shift
.
What’s LBA?
https://en.wikipedia.org/wiki/Logical_block_addressing
Logical block addressing (LBA) is a common scheme used for specifying the location of blocks of data stored on computer storage devices, generally secondary storage systems such as hard disk drives.
ns->lba_shift
seems to be nvme specific, because ns
is struct nvme_ns *
,
so I searched only in drivers/nvme
. Got this:
1
2
3
4
5
6
7
8
9
10
11
static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
{
struct nvme_ns *ns = disk->private_data;
/*
* If identify namespace failed, use default 512 byte block size so
* block layer can use before failing read/write for 0 capacity.
*/
ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
if (ns->lba_shift == 0)
ns->lba_shift = 9;
There are two setters of lbaf
:
1
2
3
4
5
6
// target/admin-cmd.c
id->nlbaf = 0;
id->lbaf[0].ds = ns->blksize_shift;
// host/lightnvm.c
nvme_nvm_set_addr_20(&geo->addrf, &id->lbaf);
lightnvm
is a different driver, so I’m ignoring that. target/admin-cmd
seems to refer to Admin commands
. SSDs have two command queues:
https://metebalci.com/blog/a-quick-tour-of-nvm-express-nvme/
Admin Commands are sent to Admin Submission and Completion Queue (there is only one of this pair with identifier=0).
I/O Commands (called NVM Commands) are sent to I/O Submission and Completion Queues. I/O Command Queues has to be explicitly managed (created / deleted etc.)
So, where does ns->blksize_shift
come from?
1
2
3
4
5
6
7
hugopeixoto@laptop:~/w/c/l/d/nvme$ ack 'blksize_shift ='
target/io-cmd-bdev.c
66: ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
target/io-cmd-file.c
56: ns->blksize_shift = min_t(u8,
78: ns->blksize_shift = 0;
I’m guessing that I care about io-cmd-bdev, since this is not related to a specific file, but to the block device.
Wait.
bdev_logical_block_size
.
I’ve seen that before.
bdev_logical_block_size
callsqueue_logical_block_size
,- which reads
limits.logical_block_size
- which is set by
blk_queue_logical_block_size
- which is called with
ns->lba_shift
- which is set to
id->lbaf[0].ds
- which is set to
ns->blksize_shift
- which is initialized to
bdev_logical_block_size
.
I may haved missed something. Apart from a bunch of 512 escape hatches, I don’t think this fetches any value from anywhere. I guess it could be manually set, or somehow overriden? I’m not sure.
I’ll move on to stat
and come back to this later.
stat: block sizes and fundamental block sizes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@laptop:/# strace -s 1024 -v stat -f /dev/nvme0 2>&1 >/dev/null | tail -n2
statfs("/dev/nvme0", {
f_type=TMPFS_MAGIC,
f_bsize=4096,
f_blocks=996251,
f_bfree=996251,
f_bavail=996251,
f_files=996251,
f_ffree=995765,
f_fsid={val=[0,
0]},
f_namelen=255,
f_frsize=4096,
f_flags=ST_VALID|ST_NOSUID|ST_NOEXEC|ST_RELATIME
}) = 0
write(1,
" File: \"/dev/nvme0\"\n"
" ID: 0 Namelen: 255 Type: tmpfs\n"
"Block size: 4096 Fundamental block size: 4096\n"
"Blocks: Total: 996251 Free: 996251 Available: 996251\n"
"Inodes: Total: 996251 Free: 995765\n",
219) = 219
This seems to match the information I obtained with debufs
earlier:
1
2
3
4
5
6
7
8
9
root@laptop:/# debugfs -R 'stats' /dev/mapper/laptop--vg-root | grep size | head -n9
debugfs 1.45.6 (20-Mar-2020)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Block size: 4096
Fragment size: 4096
Flex block group size: 16
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Looking at the source of tune2fs
, which prints basically the same info as the
debugfs stats
command, we have:
1
2
3
4
5
6
7
fprintf(f, "Block size: %u\n", EXT2_BLOCK_SIZE(sb));
if (ext2fs_has_feature_bigalloc(sb))
fprintf(f, "Cluster size: %u\n",
EXT2_CLUSTER_SIZE(sb));
else
fprintf(f, "Fragment size: %u\n",
EXT2_CLUSTER_SIZE(sb));
And according to the ext4 kernel wiki:
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Bigalloc
At the moment, the default size of a block is 4KiB, which is a commonly supported page size on most MMU-capable hardware. This is fortunate, as ext4 code is not prepared to handle the case where the block size exceeds the page size. However, for a filesystem of mostly huge files, it is desirable to be able to allocate disk blocks in units of multiple blocks to reduce both fragmentation and metadata overhead. The bigalloc feature provides exactly this ability. The administrator can set a block cluster size at mkfs time (which is stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks.
From the Blocks section:
You may experience mounting problems if block size is greater than page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory pages).
So this explains the difference between blocks and fragments. Ext4 doesn’t support fragments, but it supports clusters, which sound like the opposite, in a way.
fdisk attempt number two
1
2
3
4
5
hugopeixoto@laptop:~$ sudo fdisk -l /dev/nvme0n1p3
Disk /dev/nvme0n1p3: 237.75 GiB, 255266390016 bytes, 498567168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
This is, in retrospective, completely unrelated to whatever du
is doing. But
let’s investigate a bit anyway.
OK, so when tools are writing files, specially if it’s a lot of data, they need
to decide how much to write in each syscall. dd
, for example, allows you to
specify obs
(output_blocksize
). Here’s a blog post on this:
http://blog.tdg5.com/tuning-dd-block-size/
Sidetrack: I ended up doing a pull request to fix a small markup bug, yay static sites! https://github.com/tdg5/blog/pull/8
I would assume that it would default to I/O size (optimal), and apparently it
does default to 512, but I think this is hardcoded separately, not really
bothering to do the BLKIOOPT ioctl
.
Since fdisk
is reporting 512
here, and I’m pretty sure doing a dd obs=64K
would be faster, maybe dd
doesn’t bother because the right value is not being
reported anyway.
I’ll check my other disks:
1
2
3
4
5
6
7
8
9
10
root@desktop:~# for disk in /sys/block/{sd*,nvme*}; do
> echo "$disk $(cat "$disk"/queue/{logical_block_size,physical_block_size,minimum_io_size,optimal_io_size} |
> tr $'\n' ' ')";
> done |
> column -te
/sys/block/sda 512 4096 4096 0
/sys/block/sdb 512 4096 4096 0
/sys/block/sdc 512 4096 4096 0
/sys/block/sdd 512 4096 4096 0
/sys/block/nvme0n1 512 512 512 512
So maybe the SSD is indeed faster with 512 blocks? Let me just disprove that.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
hugopeixoto@desktop:~$ X=512; dd if=/dev/zero of=zero.img "bs=$X" count=$(( 1024 * 1024 * 1024 / "$X" ))
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.11537 s, 345 MB/s
hugopeixoto@desktop:~$ rm zero.img
hugopeixoto@desktop:~$ X=4096; dd if=/dev/zero of=zero.img "bs=$X" count=$(( 1024 * 1024 * 1024 / "$X" ))
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.658087 s, 1.6 GB/s
hugopeixoto@desktop:~$ rm zero.img
hugopeixoto@desktop:~$ X=65536; dd if=/dev/zero of=zero.img "bs=$X" count=$(( 1024 * 1024 * 1024 / "$X" ))
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.497855 s, 2.2 GB/s
I tried this a couple of times, because sometimes things get cached (and I
should use sync), and 512 was always way slower than 4096 or 65536. From 64k
up, it stays at 2.2GB/s, until it starts to go down again at bs=8M
.
So why is the SSD reporting 512? Maybe it’s to be compatible with older
systems, who knows. The SATA ones report 0
, and fdisk
defaults to 4096
.
https://superuser.com/questions/451883/can-i-change-my-ssd-sector-size
The 512B sector size reported by the SSD is only for compatibility purposes. Internally data is stored on 8kiB+ NAND pages. The SSD controller keeps track of the mapping from 512B to pages internally in the FTL (Flash Translation Layer).
The “Sector Size” is a fake number reported by the SSD controller so that legacy SATA controllers will play nicely with your SSD.
OK, so it seems that this is it. I don’t think there’s any way to get the real value for this.
Since this is read by fdisk
, I assume it’s important information to create
disk partitions, to make sure they’re aligned. Sounds like an adventure for
another day.
Summary
du
reports both apparent (byte) size and real (block) size. It takes the
filesystem’s reported block count and block size and normalizes it to 512 byte
blocks. Then, to decide what to output, it converts it again to a block size
specified by the user, using a command line argument, the DU_BLOCK_SIZE
env
variable, or even POSIXLY_CORRECT
.
jmtpfs
does not support reporting blocks, because MTP
hides fs details by
design. It always sets st_blocks=0
, which makes du
unable to report disk
usage in blocks.
ext4
uses block sizes of 4096
. It can be set at mkfs
time, but it
shouldn’t exceed the computer’s page size. It doesn’t support fragments, but it
supports bigalloc
mode (Clusters), which is kind of the same thing but in the
opposite direction. Fragments allow for less block waste when you have many
small files, and clusters allow for less block waste when you have many big
files. It also supports sparse files, and has an block mapping optimization
called extent
.
fdisk
: SSDs report block sizes of 512, to be compatible with legacy
controllers. This information may also be used by disk partitioners, although I
didn’t explore that.
smartphone
: I still need to back up the contents.