File sizes and disk usage

Published on May 06, 2020

This is a [journal] entry. It is longer than usual and more exploratory. It may have no proper conclusion.

MTP, FUSE, and filesystem syscalls
ext4 and sparse files
fdisk: units, sector sizes and I/O sizes
NVME device drivers
stat: block sizes and fundamental block sizes
fdisk attempt number two
Summary

I was backing up my old phone, running android, so I could flash a new OS. I deleted most of the stuff via the phone itself, but then I wanted to back up some files.

MTP, FUSE, and filesystem syscalls

I connected it via usb to the laptop, and use jmtpfs to mount the thing:

root@laptop:/# jmtpfs /mnt/external
Device 0 (VID=22b8 and PID=2e82) is a Motorola Moto G (ID2).
Android device detected, assigning default bug flags
root@laptop:/# ls -lah /mnt/external/
total 4.0K
drwxr-xr-x  3 root root    0 Jan  1  1970  .
drwxr-xr-x  6 root root 4.0K Jun 16  2017  ..
drwxr-xr-x 16 root root    0 Jul 19  2018 'Internal storage'

Sidenote: 1 Jan 1970, nice.

Then I wanted to know how much space I would need in my laptop to back up that directory. Usually I du -sh ., but I had a vague memory of that not working in these partitions. I tried it anyway:

root@laptop:/m/e/Internal storage# du -sh .
0       .

Why?

I know that I have files in here that take up disk space, but even duing them directly doesn’t work:

root@laptop:/m/e/Internal storage# ls -lh Download/.com.google.Chrome.X6yFrG
-rw-r--r-- 1 root root 125K Jul  3  2016 Download/.com.google.Chrome.X6yFrG

root@laptop:/m/e/Internal storage# du -h Download/.com.google.Chrome.X6yFrG
0       Download/.com.google.Chrome.X6yFrG

I wonder if du uses a different syscall? or maybe it looks at other properties? Let’s see what happens with strace.

root@laptop:/m/e/Internal storage# strace du -h Download/.com.google.Chrome.X6yFrG  > /dev/null
execve("/usr/bin/du", ["du", "-h", "Download/.com.google.Chrome.X6yF"...], 0x7fff26d27750 /* 22 vars */) = 0
brk(NULL)                               = 0x5611069c6000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=112615, ...}) = 0
mmap(NULL, 112615, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7cd07db000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 o\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1831600, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7cd07d9000
mmap(NULL, 1844568, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f7cd0616000
mmap(0x7f7cd063b000, 1351680, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f7cd063b000
mmap(0x7f7cd0785000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16f000) = 0x7f7cd0785000
mmap(0x7f7cd07cf000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x7f7cd07cf000
mmap(0x7f7cd07d5000, 13656, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7cd07d5000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f7cd07da580) = 0
mprotect(0x7f7cd07cf000, 12288, PROT_READ) = 0
mprotect(0x5611055bd000, 4096, PROT_READ) = 0
mprotect(0x7f7cd081f000, 4096, PROT_READ) = 0
munmap(0x7f7cd07db000, 112615)          = 0
brk(NULL)                               = 0x5611069c6000
brk(0x5611069e7000)                     = 0x5611069e7000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3044032, ...}) = 0
mmap(NULL, 3044032, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7cd032e000
close(3)                                = 0
newfstatat(AT_FDCWD, "Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}, AT_SYMLINK_NOFOLLOW) = 0
fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}) = 0
ioctl(1, TCGETS, 0x7ffef972ddd0)        = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "0\tDownload/.com.google.Chrome.X6"..., 37) = 37
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

So du uses newfstatat, which, according to man stat:

The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().

What does ls use?

root@laptop:/m/e/Internal storage# strace ls -lh Download/.com.google.Chrome.X6yFrG 2>&1  > /dev/null | wc -l
193

Uff, this is a bit long. Let me filter it down:

root@laptop:/m/e/Internal storage# strace ls -lh Download/.com.google.Chrome.X6yFrG 2>&1 > /dev/null | grep Download/
execve("/bin/ls", ["ls", "-lh", "Download/.com.google.Chrome.X6yF"...], 0x7fff2c1c8b90 /* 22 vars */) = 0
lstat("Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}) = 0
lgetxattr("Download/.com.google.Chrome.X6yFrG", "security.selinux", 0x5630117d9f60, 255) = -1 EOPNOTSUPP (Operation not supported)
getxattr("Download/.com.google.Chrome.X6yFrG", "system.posix_acl_access", NULL, 0) = -1 EOPNOTSUPP (Operation not supported)

Seems to use lstat. Hm, let’s take a look at them side by side:

newfstatat(AT_FDCWD, "Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}, AT_SYMLINK_NOFOLLOW) = 0
lstat("Download/.com.google.Chrome.X6yFrG", {st_mode=S_IFREG|0644, st_size=127410, ...}) = 0

They both have st_size=127410, so it doesn’t seem to be a difference in syscall behavior. du may be looking at another property of that structure. My hypothesis is that it has something to do with blocks vs bytes. I’d like to print the full structure filled by that syscall, does strace have an option for that? Maybe stat print something useful?

root@laptop:/m/e/Internal storage# stat Download/.com.google.Chrome.X6yFrG
  File: Download/.com.google.Chrome.X6yFrG
  Size: 127410          Blocks: 0          IO Block: 4096   regular file
Device: 34h/52d Inode: 31          Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 1970-01-01 01:00:00.000000000 +0100
Modify: 2016-07-03 16:33:05.000000000 +0100
Change: 1970-01-01 01:00:00.000000000 +0100
 Birth: -

Ah, Size=127410 and Blocks=0. This could be it. Let me try with a non /mnt/external file.

root@laptop:/m/e/Internal storage# ls -l ~/.bashrc
-rw-r--r-- 1 root root 646 Aug  1  2016 /root/.bashrc

root@laptop:/m/e/Internal storage# du -h ~/.bashrc
4.0K    /root/.bashrc

root@laptop:/m/e/Internal storage# stat ~/.bashrc
  File: /root/.bashrc
  Size: 646             Blocks: 8          IO Block: 4096   regular file
Device: fe01h/65025d    Inode: 3670019     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-05-06 14:47:33.290946241 +0100
Modify: 2016-08-01 16:51:58.622367982 +0100
Change: 2016-08-01 16:51:58.622367982 +0100
 Birth: -

So we’ve got 646 from ls, which matches stat’s Size. du prints 4.0K, which.. could be 512*8, I guess. Let me read man du.

Display values are in units of the first available SIZE from –block-size, and the DU_BLOCK_SIZE, BLOCK_SIZE and BLOCKSIZE environment variables. Otherwise, units default to 1024 bytes (or 512 if POSIXLY_CORRECT is set).

The SIZE argument is an integer and optional unit (example: 10K is 10*1024). Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,… (powers of 1000).

Meh, this is confusing. Let me check the source of du. Gnu coreutils has a github mirror:

https://github.com/coreutils/coreutils/blob/master/src/du.c

1kloc, not bad. I wish I could look at a musl equivalent of coreutils.

/* If true, rather than using the disk usage of each file,
   use the apparent size (a la stat.st_size).  */
static bool apparent_size = false;

I saw something related to “apparent size” in man du, so I’m guessing that if I use that flag I’ll probably get du to print the same as ls, but I want to figure out what it prints by default first. Where is this boolean used?

  duinfo_set (&dui,
              (apparent_size
               ? MAX (0, sb->st_size)
               : (uintmax_t) ST_NBLOCKS (*sb) * ST_NBLOCKSIZE),
              (time_type == time_mtime ? get_stat_mtime (sb)
               : time_type == time_atime ? get_stat_atime (sb)
               : get_stat_ctime (sb)));

OK, so we have st_size for apparent size, and ST_NBLOCKS(*sb) * ST_NBLOCKSIZE by default. Googling for ST_NBLOCKS got me a stackoverflow Q&A:

https://unix.stackexchange.com/questions/521151/why-is-st-blocks-always-reported-in-512-byte-blocks

Why is st_blocks always reported in 512-byte blocks? I was debugging a fuse filesystem that was reporting wrong sizes for du. It turned out that it was putting st_size / st_blksize [*] into st_blocks of the stat structure. The Linux manual page for stat(2) says:

This seems to be a common problem.

The size of a block is implementation-specific. On Linux it’s always 512 bytes, for historical reasons; in particular, it used to be the typical size of a disk sector.

So this seems to match what I thought, it’s 8*512. Out of curiosity, let me try to find ST_NBLOCKSIZE.

Found something similar:

hugopeixoto@laptop:~$ ack S_BLKSIZE /usr/include/
/usr/include/x86_64-linux-gnu/sys/stat.h
199:# define S_BLKSIZE  512     /* Block size for `st_blocks'.  */

From the internets, ST_NBLOCKSIZE seems to be #defined as S_BLKSIZE, so this is probably it.

https://github.com/rofl0r/gnulib/blob/master/lib/stat-size.h#L105

And ST_NBLOCKS is defined as:

#    define ST_NBLOCKS(statbuf) \
  (S_ISREG ((statbuf).st_mode) || S_ISDIR ((statbuf).st_mode) \
     ? (statbuf).st_blocks * ST_BLKSIZE (statbuf) / ST_NBLOCKSIZE : 0)

And ST_BLKSIZE (not to be confused with (S_BLKSIZE) is, most of the times, st_blksize. So du calculates disk usage as:

ST_NBLOCKS(*sb) * ST_NBLOCKSIZE =
(*sb).st_blocks * ST_BLKSIZE(*sb) / ST_NBLOCKSIZE * ST_NBLOCKSIZE =
sb->st_blocks * sb->st_blksize / S_BLKSIZE * S_BLKSIZE =
sb->st_blocks * sb->st_blksize / 512 * 512 =
sb->st_blocks * sb->st_blksize

Sidenote: st_blksize is a multiple of 512, that’s why we’re able to simplify it by removing / 512 * 512.

So through a bunch of indirections, this ends up returning file system information directly.

Let me try the POSIXLY_CORRECT thing:

hugopeixoto@laptop:~$ du history/bash
4992    history/bash
hugopeixoto@laptop:~$ POSIXLY_CORRECT=1 du history/bash
9984    history/bash

OK, that checks out. The number of blocks doubles, because it uses 512 instead of 1024.

Back to the apparent size thing:

root@laptop:/m/e/Internal storage# du -h --apparent-size Download/.com.google.Chrome.X6yFrG
125K    Download/.com.google.Chrome.X6yFrG
root@laptop:/m/e/Internal storage# du --apparent-size -sh .
805M    .

Seems OK. Now, why does stat return Blocks=0? This should be related to jmtpfs. man jmtpfs:

jmtpfs is a FUSE and libmtp based filesystem for accessing MTP (Media Transfer Protocol) devices. It was specifically designed for exchaning files between Linux (and Mac OS X) systems and newer An‐ droid devices that support MTP but not USB Mass Storage.

So, my current guess is that libmtp does not return filesystem related stuff and the FUSE part defaults to zero. Checking this by downloading the source for jmtpfs and grepping for st_blocks.

hugopeixoto@laptop:~/w/contrib$ apt source jmtpfs
Reading package lists... Done
NOTICE: 'jmtpfs' packaging is maintained in the 'Git' version control system at:
git://anonscm.debian.org/collab-maint/jmtpfs.git
Please use:
git clone git://anonscm.debian.org/collab-maint/jmtpfs.git
to retrieve the latest (possibly unreleased) updates to the package.
Need to get 148 kB of source archives.
Get:1 http://ftp.pt.debian.org/debian testing/main jmtpfs 0.5-2 (dsc) [1,874 B]
Get:2 http://ftp.pt.debian.org/debian testing/main jmtpfs 0.5-2 (tar) [143 kB]
Get:3 http://ftp.pt.debian.org/debian testing/main jmtpfs 0.5-2 (diff) [3,321 B]
Fetched 148 kB in 0s (473 kB/s)
dpkg-source: info: extracting jmtpfs in jmtpfs-0.5
dpkg-source: info: unpacking jmtpfs_0.5.orig.tar.gz
dpkg-source: info: unpacking jmtpfs_0.5-2.debian.tar.gz
hugopeixoto@laptop:~/w/contrib$ ack st_nblocks jmtpfs-0.5/src/
hugopeixoto@laptop:~/w/contrib$

No luck on the first try. I’ll browse the source a bit. Oh, maybe I didn’t found any matches because it isn’t set at all.

$ ack block jmtpfs-0.5/
jmtpfs-0.5/src/MtpNode.cpp
131:    stat->f_bsize = 512;  // We have to pick some block size, so why not 512?
132:    stat->f_blocks = storageInfo.maxCapacity / stat->f_bsize;

jmtpfs-0.5/src/MtpRoot.cpp
111:    stat->f_bsize = 512;  // We have to pick some block size, so why not 512?
112:    stat->f_blocks = totalSize / stat->f_bsize;

Hm, f_blocks? What’s the difference between an f_* and a st_*?

So, this uses struct statvfs. I guess that’s what FUSE should use. But it seems to be setting f_blocks to 512, so something is up.

hugopeixoto@laptop:~/w/c/j/src$ ack st_size
MtpFile.cpp
55:             info.st_size = localFile->getSize();
58:             info.st_size = md.self.filesize;
103:    if (info.st_size == length)

OK, so we do have a st_size being set. And no st_blocks. What’s MTP anyway? Does it report blocks? (probably not)

https://en.wikipedia.org/wiki/Media_Transfer_Protocol

The Media Transfer Protocol (MTP) is an extension to the Picture Transfer Protocol (PTP) communications protocol that allows media files to be transferred atomically to and from portable devices.

…

MTP is a key part of WMDRM10-PD,[1] a digital rights management (DRM) service for the Windows Media platform.

…

In 2011, it became the standard method to transfer files from/to Android

…

The USB Implementers Forum device working group standardised MTP as a full-fledged Universal Serial Bus (USB) device class in May 2008.[4] Since then MTP is an official extension to PTP and shares the same class code.[5]

Cool?

Comparison with USB Mass Storage

Ah, this sounds promising.

File oriented instead of block oriented protocol By not exposing the filesystem and metadata index, the integrity of these is in full control of the device.

Well, there you go.

None of my devices support USB Mass Storage. And from this SO answer:

https://android.stackexchange.com/questions/190138/how-to-use-usb-mass-storage-mode-on-android-4-3

If device implementations have a USB port with USB peripheral mode support, they:

MAY use USB mass storage, but SHOULD use Media Transfer Protocol to satisfy this requirement.

It seems like this is by design.

Now, back to strace. Is there a way to print the full structure? man strace, --no-abbrev seems like a good candidate:

--no-abbrev

Print unabbreviated versions of environment, stat, termios, etc. calls.  These
structures are very common in calls and so the default behavior displays a
reasonable subset of structure members.  Use this option to get all of the gory
details.

Let’s give it a go:

root@laptop:/m/e/Internal storage# strace --no-abbrev -- du -h Download/.com.google.Chrome.X6yFrG 2>&1 > /dev/null | grep newfstata
newfstatat(AT_FDCWD, "Download/.com.google.Chrome.X6yFrG", {
  st_dev=makedev(0, 0x34),
  st_ino=4,
  st_mode=S_IFREG|0644,
  st_nlink=1,
  st_uid=0,
  st_gid=0,
  st_blksize=4096,
  st_blocks=0,
  st_size=127410,
  st_atime=0,
  st_atime_nsec=0,
  st_mtime=1467559985 /* 2016-07-03T16:33:05+0100 */,
  st_mtime_nsec=0,
  st_ctime=0,
  st_ctime_nsec=0
}, AT_SYMLINK_NOFOLLOW) = 0

There is it (indented for readability). strace -v seems to do the same. Not only the blocks are zero, but there are some timestamp related fields also at zero. Here’s the same for a non MTP file:

root@laptop:/m/e/Internal storage# strace -v -- du -h ~/.bashrc 2>&1 > /dev/null | grep newfstata
newfstatat(AT_FDCWD, "/root/.bashrc", {
  st_dev=makedev(0xfe, 0x1),
  st_ino=3670019,
  st_mode=S_IFREG|0644,
  st_nlink=1,
  st_uid=0,
  st_gid=0,
  st_blksize=4096,
  st_blocks=8,
  st_size=646,
  st_atime=1588772853 /* 2020-05-06T14:47:33.290946241+0100 */,
  st_atime_nsec=290946241,
  st_mtime=1470066718 /* 2016-08-01T16:51:58.622367982+0100 */,
  st_mtime_nsec=622367982,
  st_ctime=1470066718 /* 2016-08-01T16:51:58.622367982+0100 */,
  st_ctime_nsec=622367982
}, AT_SYMLINK_NOFOLLOW) = 0

Curious that there’s a st_blksize that is not related to st_blocks. As explained in one of the above stackoverflow links:

It indicates the preferred size for I/O, i.e. the amount of data that should be transferred in one operation for optimal results (ignoring other layers in the I/O stack).

I wonder what the purpose of st_blocks is. It can’t be just st_size / 512. That would be kind of pointless, right?

Found a thread from 1992, but it doesn’t help much.

https://groups.google.com/forum/#!topic/comp.unix.programmer/7saTJ9gRBEM

This could be related to sparse files.

ext4 and sparse files

Let me look that up and try some cli shenanigans. From the Arch wiki:

https://wiki.archlinux.org/index.php/sparse_file

Creating sparse files The truncate utility can create sparse files. This command creates a 512 MiB sparse file:

$ truncate -s 512M file.img

hugopeixoto@laptop:~/w/c/j/src$ truncate -s 512M file.img
hugopeixoto@laptop:~/w/c/j/src$ du -sh --apparent-size file.img
512M    file.img
hugopeixoto@laptop:~/w/c/j/src$ du -sh file.img
0       file.img
hugopeixoto@laptop:~/w/c/j/src$ stat file.img
  File: file.img
  Size: 536870912       Blocks: 0          IO Block: 4096   regular file
Device: fe01h/65025d    Inode: 2904608     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/hugopeixoto)   Gid: ( 1000/hugopeixoto)
Access: 2020-05-06 17:11:22.171650277 +0100
Modify: 2020-05-06 17:11:22.171650277 +0100
Change: 2020-05-06 17:11:22.171650277 +0100
 Birth: -

Hm, so the size is indeed different from Blocks * 512. This feels like filesystem specific stuff. I can now either:

look through ext4’s source code
or look for a tool that prints fs data and metadata

I have the file’s inode thanks to stat: 2904608. I’m using ext4, so I can use the debugfs tool, from apt install e2progs. This allows me to run different debug commands.

root@laptop:/# debugfs -R 'imap <2904608>' /dev/mapper/laptop--vg-root
debugfs 1.45.6 (20-Mar-2020)
Inode 2904608 is part of block group 354
        located at block 11535681, offset 0x0f00

Can I read this block? I need the block size, to know where to seek, and the inode size, to know how much to read.

root@laptop:/# debugfs -R 'stats' /dev/mapper/laptop--vg-root | grep size | head -n9
debugfs 1.45.6 (20-Mar-2020)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Block size:               4096
Fragment size:            4096
Flex block group size:    16
Inode size:               256
Required extra isize:     28
Desired extra isize:      28

So.. I should seek to 115356814096+0x0f00 and read 256 bytes? 0xf00 is 15256=3840, so that’s inside the 4096 limit. It’s the last inode in this block, apparently. It’s kind of a large number, though. Tried reading this with dd, didn’t work.

root@laptop:/# dd if=/dev/mapper/laptop--vg-root bs=1 seek=47250153216 count=256
dd: 'standard output': cannot seek: Illegal seek
0+0 records in
0+0 records out
0 bytes copied, 0.000325579 s, 0.0 kB/s

I hope I don’t accidentally my main partition. This could have to do with the fact that I’m using an LVM encrypted volume. Maybe seeks are not allowed.

Crap. I should use skip when reading, not seek. seek is for writing. I hope I didn’t jumble anything.

root@laptop:/# dd if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 | xxd
256+0 records in
256+0 records out
256 bytes copied, 0.00061617 s, 415 kB/s
00000000: a481 e803 0000 0020 aae1 b25e aae1 b25e  ....... ...^...^
00000010: aae1 b25e 0000 0000 e803 0100 0000 0000  ...^............
00000020: 0000 0800 0100 0000 0af3 0000 0400 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 64e2 2b6f 0000 0000 0000 0000  ....d.+o........
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 2000 0000 94b3 ec28 94b3 ec28 94b3 ec28   ......(...(...(
00000090: aae1 b25e 94b3 ec28 0000 0000 0000 0000  ...^...(........
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

So this is my inode. Cool. Let me try to write a single byte at pos=0 in the sparse file and try again (checking if the inode changes before running dd again).

hugopeixoto@laptop:~/w/c/j/src$ echo -n "*" | dd of=file.img bs=1 seek=0 count=1
1+0 records in
1+0 records out
1 byte copied, 0.000335263 s, 3.0 kB/s

hugopeixoto@laptop:~/w/c/j/src$ stat file.img | grep Inode
Device: fe01h/65025d    Inode: 2904608     Links: 1

root@laptop:/# dd if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 | xxd
256+0 records in
256+0 records out
256 bytes copied, 0.00263862 s, 97.0 kB/s
00000000: a481 e803 0100 0000 aae1 b25e e7f9 b25e  ...........^...^
00000010: e7f9 b25e 0000 0000 e803 0100 0800 0000  ...^............
00000020: 0000 0800 0100 0000 0af3 0100 0400 0000  ................
00000030: 0000 0000 0000 0000 0100 0000 bc6d 5a00  .............mZ.
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 64e2 2b6f 0000 0000 0000 0000  ....d.+o........
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 2000 0000 00b1 c3df 00b1 c3df 94b3 ec28   ..............(
00000090: aae1 b25e 94b3 ec28 0000 0000 0000 0000  ...^...(........
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

OK, inode stayed the same. I don’t see the * in here, which makes sense. File contents are not inlined.

What’s the difference between these two? I need to know what these fields mean first.

https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inode_Table

Let’s see if I can decode the timestamps.

Access time, offset 0x8, size=0x4

hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 |
> ruby -e "puts Time.at(STDIN.read[0x8...(0x8+4)].unpack('V')[0])"
2020-05-06 17:11:22 +0100

Sidenote: https://ruby-doc.org/core-2.7.1/String.html#method-i-unpack

Looks sane. OK, now that I know I’m reading the right thing, let’s see what else is in this table.

| 0x1C | __le32 | i_blocks_lo | Lower 32-bits of “block” count. If the huge_file feature flag is not set on the filesystem, the file consumes i_blocks_lo 512-byte blocks on disk.

hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47250153216 count=256 |
> ruby -e "puts STDIN.read[0x1C...(0x1C+4)].unpack('V')"
8

So this is currently taking 8 blocks? What does stat say again?

hugopeixoto@laptop:~/w/c/j/src$ stat file.img | grep Blocks
  Size: 1               Blocks: 8          IO Block: 4096   regular file

OK, so stat reads what’s in this table. The 512 thing is referenced in the ext4 docs again.

What if I now write a byte to the end of the file? What changes?

Oh no, I just noticed that when I wrote to the first byte, I lost the sparseness of the file:

hugopeixoto@laptop:~/w/c/j/src$ du --apparent-size file.img
1       file.img

That’s fine. The inode exploration still stands. I should have used conv=notrunc:

hugopeixoto@laptop:~$ echo -n "*" | dd of=file.img bs=1 seek=0 count=1 conv=sparse conv=notrunc
hugopeixoto@laptop:~$ stat file.img
  File: file.img
  Size: 536870912       Blocks: 8          IO Block: 4096   regular file
Device: fe01h/65025d    Inode: 2901111     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/hugopeixoto)   Gid: ( 1000/hugopeixoto)
Access: 2020-05-06 20:40:26.700433373 +0100
Modify: 2020-05-06 20:40:29.600453655 +0100
Change: 2020-05-06 20:40:29.600453655 +0100
 Birth: -

OK, ready to go. New offsets: inode=2901111, offset=47249257984

Writing last byte:

hugopeixoto@laptop:~$ echo -n "*" | dd of=file.img bs=1 seek=536870911 count=1 conv=notrunc
hugopeixoto@laptop:~$ stat file.img
  File: file.img
  Size: 536870912       Blocks: 16         IO Block: 4096   regular file
Device: fe01h/65025d    Inode: 2901111     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/hugopeixoto)   Gid: ( 1000/hugopeixoto)
Access: 2020-05-06 20:40:26.700433373 +0100
Modify: 2020-05-06 20:43:57.649962142 +0100
Change: 2020-05-06 20:43:57.649962142 +0100
 Birth: -
hugopeixoto@laptop:~$ du -h --apparent-size file.img
512M    file.img
hugopeixoto@laptop:~$ du -h file.img
8.0K    file.img

Size stays at 512M, but disk usage is low. How does the file system handle this? Let’s take a look at the inode entry again (since I had to recreate the file anyway):

hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47249257984 count=256 | xxd
00000000: a481 e803 0000 0020 aa12 b35e 7d13 b35e  ....... ...^}..^
00000010: 7d13 b35e 0000 0000 e803 0100 1000 0000  }..^............
00000020: 0000 0800 0100 0000 0af3 0200 0400 0000  ................
00000030: 0000 0000 0000 0000 0100 0000 0038 ab02  .............8..
00000040: ffff 0100 0100 0000 0000 ad02 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 fc40 b686 0000 0000 0000 0000  .....@..........
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 2000 0000 788a f69a 788a f69a 740f ffa6   ...x...x...t...
00000090: aa12 b35e 740f ffa6 0000 0000 0000 0000  ...^t...........
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

There’s a flags field, let’s see what’s up there. Also, I need an alias to get the block info.

hugopeixoto@laptop:~$ alias getblock="sudo dd status=none if=/dev/mapper/laptop--vg-root bs=1 skip=47249257984 count=256"
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x20..(0x20+4)].unpack('V')[0].to_s(16)"
80000

0x80000 means that the inode uses extents:

In ext4, the file to logical block map has been replaced with an extent tree. Under the old scheme, allocating a contiguous run of 1,000 blocks requires an indirect block to map all 1,000 entries; with extents, the mapping is reduced to a single struct ext4_extent with ee_len = 1000.

So, what I’m getting from this is that file contents are in blocks, and there’s a map somewhere to point file block 1 to disk block X, file block 2 to disk block Y, etc. Before there had to be an entry for every file block. Now, if contiguous file blocks are in contiguous disk blocks, this can be compressed using extents. Let’s see where we can find that information for file.img. Where do we start?

| 0x28 | 60 bytes | i_block[EXT4_N_BLOCKS=15] | Block map or extent tree. See the section “The Contents of inode.i_block”.

hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)]" | xxd
00000000: 0af3 0200 0400 0000 0000 0000 0000 0000  ................
00000010: 0100 0000 0038 ab02 ffff 0100 0100 0000  .....8..........
00000020: 0000 ad02 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 fc0a       ..............

There’s a 0af3, which matches the documentation (considering endianess):

0x0 __le16 eh_magic Magic number, 0xF30A.

Let’s extract the full header:

hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)].unpack('H4vvvV')"
0af3
2
4
0
0

Magic number, check. 2 entries, maximum of 4, depth level 0. The depth level is important:

Depth of this extent node in the extent tree. 0 = this extent node points to data blocks

So, this could be 2 entries pointing to one datablock each. One for the beginning of the sparse file and one for the end of the sparse file, maybe? If each data block is 4096 bytes, that would be 8k disk usage. These two entries point directly to data blocks, so they’re of the type struct ext4_extent:

| 0x0 | __le32 | ee_block    | First file block number that this extent covers.
| 0x4 | __le16 | ee_len      | Number of blocks covered by extent. [...]
| 0x6 | __le16 | ee_start_hi | Upper 16-bits of the block number to which this extent points.
| 0x8 | __le32 | ee_start_lo | Lower 32-bits of the block number to which this extent points.

Let’s get their info:

hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)][0xC...(0xC+12)].unpack('VvvV')"
0
1
0
44775424
hugopeixoto@laptop:~$ getblock | ruby -e "puts STDIN.read[0x28..(0x28+60)][0xC+12...(0xC+12+12)].unpack('VvvV')"
131071
1
0
44892160

So the first extent points to the first file block, and it matches the disk block 44775424. The second extent points to the file block 131071, and it matches the disk block 44892160. Cute!

Let’s try to read the data blocks:

hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=4096 skip=44775424 count=1 | xxd | head -n3
00000000: 2a00 0000 0000 0000 0000 0000 0000 0000  *...............
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
hugopeixoto@laptop:~$ sudo dd status=none if=/dev/mapper/laptop--vg-root bs=4096 skip=44892160 count=1 | xxd | tail -n3
00000fd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000fe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000ff0: 0000 0000 0000 0000 0000 0000 0000 002a  ...............*

There are my asterisks. Notice head vs tail.

Back to the 512 byte block size. What’s up with that? Apparently, according to the internet, it matches the physical sector size of older hard drives.

https://askubuntu.com/questions/1144535/is-the-default-512-byte-physical-sector-size-appropriate-for-ssd-disks-under-lin

Nowadays disks use 4096 byte blocks, but to keep everything compatible, we still use 512?

hugopeixoto@laptop:~$ stat -f /dev/nvme0
  File: "/dev/nvme0"
    ID: 0        Namelen: 255     Type: tmpfs
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 996251     Free: 996251     Available: 996251
Inodes: Total: 996251     Free: 995747

hugopeixoto@laptop:~$ sudo fdisk -l /dev/nvme0n1p3
Disk /dev/nvme0n1p3: 237.75 GiB, 255266390016 bytes, 498567168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

So statting the disk reports a bunch of block sizes = 4096. It matches the IO blocks that show up when statting a file. fdisking the partition, though, displays 512 all over the place.

I wonder what kind of syscalls, or metadata, all of these come from.

fdisk: units, sector sizes and I/O sizes

root@laptop:~# strace -v fdisk -l /dev/nvme0n1 2>&1 >/dev/null| grep 512
ioctl(3, BLKIOMIN, [512])               = 0
ioctl(3, BLKIOOPT, [512])               = 0
ioctl(3, BLKPBSZGET, [512])             = 0
ioctl(3, BLKSSZGET, [512])              = 0
ioctl(3, BLKSSZGET, [512])              = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
lseek(3, 512, SEEK_SET)                 = 512
read(3, "EFI PART\0\0\1\0\\\0\0\0\300W\215\226\0\0\0\0\1\0\0\0\0\0\0\0"..., 512) = 512
read(3, "EFI PART\0\0\1\0\\\0\0\0\16\372\2731\0\0\0\0\2572\317\35\0\0\0\0"..., 512) = 512

BLKIOMIN and BLKIOOPT are probably I/O size (minimum/optimal)

BLKSSZGET and BLKPBSZGET are Sector size (logical/physical)

Unsure about that Units: sector of 1 * 512.

Sidenote: I just learned of strace -s N, which increases the default string size before it gets ellipsed.

I’ll check fdisk’s source, with every value being 512 it’s a bit hard to understand where Units comes from.

From a random mirror:

https://github.com/karelzak/util-linux/blob/master/disk-utils/fdisk-list.c#L76

fdisk_info(cxt, _("Units: %s of %d * %ld = %ld bytes"),
  fdisk_get_unit(cxt, FDISK_PLURAL),
  fdisk_get_units_per_sector(cxt),
  fdisk_get_sector_size(cxt),
  fdisk_get_units_per_sector(cxt) * fdisk_get_sector_size(cxt));

fdisk_info(cxt, _("Sector size (logical/physical): %lu bytes / %lu bytes"),
  fdisk_get_sector_size(cxt),
  fdisk_get_physector_size(cxt));

fdisk_info(cxt, _("I/O size (minimum/optimal): %lu bytes / %lu bytes"),
  fdisk_get_minimal_iosize(cxt),
  fdisk_get_optimal_iosize(cxt));

It seems to use sector_size (ioctl BLKSSZGET) in Units as well. Trying to figure out where the actual ioctl is called. Currently going through topology.c and topology/ioctl.c.

Tracing it to probe.c:blkid_probe_get_sectorsize. Onto lib/blkdev.c:blkdev_get_sector_size, and here it is. Unsure why this is so different from the other three, which are defined in topology/ioctl.c. Saving syscalls, maybe.

NVME device drivers

ioctl calls are usually handled by the device driver. I suppose my SSD requires a device driver, but I have no idea what it is.

https://unix.stackexchange.com/questions/248494/how-to-find-the-driver-module-associated-with-sata-device-on-linux

Use udevadm info as described in the other answer to the link you mentioned

I know that udev handles devices and hotplug, so I guess it makes sense that it can list the device drivers.

root@laptop:~# udevadm info -a -n /dev/nvme0 | egrep looking\|DRIVER
  looking at device '/devices/pci0000:00/0000:00:1d.0/0000:3c:00.0/nvme/nvme0':
    DRIVER==""
  looking at parent device '/devices/pci0000:00/0000:00:1d.0/0000:3c:00.0':
    DRIVERS=="nvme"
  looking at parent device '/devices/pci0000:00/0000:00:1d.0':
    DRIVERS=="pcieport"
  looking at parent device '/devices/pci0000:00':
    DRIVERS==""

I suppose that the relevant part is the nvme.

https://github.com/torvalds/linux/tree/master/drivers/nvme

A host and a target directory. What are those? They both seem small. Can I shallowly clone this directory? I don’t want to clone the full kernel tree. Meh, maybe it’s fast. Trying to shallow clone.

hugopeixoto@laptop:~/w/contrib$ git clone https://github.com/torvalds/linux --depth 1 --no-checkout
Cloning into 'linux'...
remote: Enumerating objects: 72113, done.
remote: Counting objects: 100% (72113/72113), done.
remote: Compressing objects: 100% (67592/67592), done.
remote: Total 72113 (delta 5341), reused 24234 (delta 3791), pack-reused 0
Receiving objects: 100% (72113/72113), 190.56 MiB | 6.87 MiB/s, done.
Resolving deltas: 100% (5341/5341), done.
hugopeixoto@laptop:~/w/c/linux$ git checkout master

A minute or so, not bad. Found this:

// in block/ioctl.c:
/*
 * Common commands that are handled the same way on native and compat
 * user space. Note the separate arg/argp parameters that are needed
 * to deal with the compat_ptr() conversion.
 */
static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
  unsigned cmd, unsigned long arg, void __user *argp)
{
  unsigned int max_sectors;

  switch (cmd) {
    // [...]
    case BLKSSZGET: /* get block device logical block size */
      return put_int(argp, bdev_logical_block_size(bdev));
    // [...]
  }
}

And that leads to:

// in include/linux/blkdev.h
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
  int retval = 512;

  if (q && q->limits.logical_block_size)
    retval = q->limits.logical_block_size;

  return retval;
}

static inline unsigned int bdev_logical_block_size(struct block_device *bdev)
{
  return queue_logical_block_size(bdev_get_queue(bdev));
}

There’s a hardcoded 512 again. Does anyone set limits.logical_block_size?

// in block/blk-settings.c

void blk_set_default_limits(struct queue_limits *lim)
{
  // [...]
  lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;

  // [...]
}

// [...]

void blk_queue_logical_block_size(struct request_queue *q, unsigned int size)
{
  q->limits.logical_block_size = size;

  if (q->limits.physical_block_size < size)
    q->limits.physical_block_size = size;

  if (q->limits.io_min < q->limits.physical_block_size)
    q->limits.io_min = q->limits.physical_block_size;
}

// [...]

int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t start)
{
  // [...]
  t->logical_block_size = max(t->logical_block_size, b->logical_block_size);
  // [...]
}

Another default, a setter, and something about “stacked drivers like MD and DM”. The NVME driver has a few calls to the setter:

hugopeixoto@laptop:~/w/c/l/d/nvme$ ack blk_queue_logical_block_size
host/multipath.c
383:    blk_queue_logical_block_size(q, 512);

host/core.c
1843:   blk_queue_logical_block_size(disk->queue, bs);
3594:   blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);

Let’s see what’s up in core.c.

static void nvme_update_disk_info(struct gendisk *disk, struct nvme_ns *ns, struct nvme_id_ns *id)
{
  // [...]
  unsigned short bs = 1 << ns->lba_shift;
  // [...]
  if (ns->lba_shift > PAGE_SHIFT) {
    /* unsupported block size, set capacity to 0 later */
    bs = (1 << 9);
  }
  // [...]
  blk_queue_logical_block_size(disk->queue, bs);
  // [...]
}


// [...]

static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
  // [...]
  ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */

  blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
  // [...]
}

So, apart from another pair of 512s, it may be the result of reading from ns->lba_shift. What’s LBA?

https://en.wikipedia.org/wiki/Logical_block_addressing

Logical block addressing (LBA) is a common scheme used for specifying the location of blocks of data stored on computer storage devices, generally secondary storage systems such as hard disk drives.

ns->lba_shift seems to be nvme specific, because ns is struct nvme_ns *, so I searched only in drivers/nvme. Got this:

static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
{
  struct nvme_ns *ns = disk->private_data;

  /*
   * If identify namespace failed, use default 512 byte block size so
   * block layer can use before failing read/write for 0 capacity.
   */
  ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
  if (ns->lba_shift == 0)
    ns->lba_shift = 9;

There are two setters of lbaf:

// target/admin-cmd.c
  id->nlbaf = 0;
  id->lbaf[0].ds = ns->blksize_shift;

// host/lightnvm.c
nvme_nvm_set_addr_20(&geo->addrf, &id->lbaf);

lightnvm is a different driver, so I’m ignoring that. target/admin-cmd seems to refer to Admin commands. SSDs have two command queues:

https://metebalci.com/blog/a-quick-tour-of-nvm-express-nvme/

Admin Commands are sent to Admin Submission and Completion Queue (there is only one of this pair with identifier=0).

I/O Commands (called NVM Commands) are sent to I/O Submission and Completion Queues. I/O Command Queues has to be explicitly managed (created / deleted etc.)

So, where does ns->blksize_shift come from?

hugopeixoto@laptop:~/w/c/l/d/nvme$ ack 'blksize_shift ='
target/io-cmd-bdev.c
66:     ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));

target/io-cmd-file.c
56:     ns->blksize_shift = min_t(u8,
78:     ns->blksize_shift = 0;

I’m guessing that I care about io-cmd-bdev, since this is not related to a specific file, but to the block device.

Wait.

bdev_logical_block_size.

I’ve seen that before.

bdev_logical_block_size calls queue_logical_block_size,
which reads limits.logical_block_size
which is set by blk_queue_logical_block_size
which is called with ns->lba_shift
which is set to id->lbaf[0].ds
which is set to ns->blksize_shift
which is initialized to bdev_logical_block_size.

I may haved missed something. Apart from a bunch of 512 escape hatches, I don’t think this fetches any value from anywhere. I guess it could be manually set, or somehow overriden? I’m not sure.

I’ll move on to stat and come back to this later.

stat: block sizes and fundamental block sizes

root@laptop:/# strace -s 1024 -v stat -f /dev/nvme0 2>&1 >/dev/null | tail -n2
statfs("/dev/nvme0", {
  f_type=TMPFS_MAGIC,
  f_bsize=4096,
  f_blocks=996251,
  f_bfree=996251,
  f_bavail=996251,
  f_files=996251,
  f_ffree=995765,
  f_fsid={val=[0,
  0]},
  f_namelen=255,
  f_frsize=4096,
  f_flags=ST_VALID|ST_NOSUID|ST_NOEXEC|ST_RELATIME
}) = 0
write(1,
  "  File: \"/dev/nvme0\"\n"
  "    ID: 0        Namelen: 255     Type: tmpfs\n"
  "Block size: 4096       Fundamental block size: 4096\n"
  "Blocks: Total: 996251     Free: 996251     Available: 996251\n"
  "Inodes: Total: 996251     Free: 995765\n",
  219) = 219

This seems to match the information I obtained with debufs earlier:

root@laptop:/# debugfs -R 'stats' /dev/mapper/laptop--vg-root | grep size | head -n9
debugfs 1.45.6 (20-Mar-2020)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Block size:               4096
Fragment size:            4096
Flex block group size:    16
Inode size:               256
Required extra isize:     28
Desired extra isize:      28

Looking at the source of tune2fs, which prints basically the same info as the debugfs stats command, we have:

  fprintf(f,   "Block size:               %u\n", EXT2_BLOCK_SIZE(sb));
  if  (ext2fs_has_feature_bigalloc(sb))
    fprintf(f, "Cluster size:             %u\n",
      EXT2_CLUSTER_SIZE(sb));
  else
    fprintf(f, "Fragment size:            %u\n",
      EXT2_CLUSTER_SIZE(sb));

And according to the ext4 kernel wiki:

https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Bigalloc

At the moment, the default size of a block is 4KiB, which is a commonly supported page size on most MMU-capable hardware. This is fortunate, as ext4 code is not prepared to handle the case where the block size exceeds the page size. However, for a filesystem of mostly huge files, it is desirable to be able to allocate disk blocks in units of multiple blocks to reduce both fragmentation and metadata overhead. The bigalloc feature provides exactly this ability. The administrator can set a block cluster size at mkfs time (which is stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks.

From the Blocks section:

You may experience mounting problems if block size is greater than page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory pages).

So this explains the difference between blocks and fragments. Ext4 doesn’t support fragments, but it supports clusters, which sound like the opposite, in a way.

fdisk attempt number two

hugopeixoto@laptop:~$ sudo fdisk -l /dev/nvme0n1p3
Disk /dev/nvme0n1p3: 237.75 GiB, 255266390016 bytes, 498567168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

This is, in retrospective, completely unrelated to whatever du is doing. But let’s investigate a bit anyway.

OK, so when tools are writing files, specially if it’s a lot of data, they need to decide how much to write in each syscall. dd, for example, allows you to specify obs (output_blocksize). Here’s a blog post on this:

http://blog.tdg5.com/tuning-dd-block-size/

Sidetrack: I ended up doing a pull request to fix a small markup bug, yay static sites! https://github.com/tdg5/blog/pull/8

I would assume that it would default to I/O size (optimal), and apparently it does default to 512, but I think this is hardcoded separately, not really bothering to do the BLKIOOPT ioctl.

Since fdisk is reporting 512 here, and I’m pretty sure doing a dd obs=64K would be faster, maybe dd doesn’t bother because the right value is not being reported anyway.

I’ll check my other disks:

root@desktop:~# for disk in /sys/block/{sd*,nvme*}; do
> echo "$disk $(cat "$disk"/queue/{logical_block_size,physical_block_size,minimum_io_size,optimal_io_size} |
>               tr $'\n' ' ')";
> done |
> column -te
/sys/block/sda      512  4096  4096  0
/sys/block/sdb      512  4096  4096  0
/sys/block/sdc      512  4096  4096  0
/sys/block/sdd      512  4096  4096  0
/sys/block/nvme0n1  512  512   512   512

So maybe the SSD is indeed faster with 512 blocks? Let me just disprove that.

hugopeixoto@desktop:~$ X=512; dd if=/dev/zero of=zero.img "bs=$X" count=$(( 1024 * 1024 * 1024 / "$X" ))
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.11537 s, 345 MB/s
hugopeixoto@desktop:~$ rm zero.img
hugopeixoto@desktop:~$ X=4096; dd if=/dev/zero of=zero.img "bs=$X" count=$(( 1024 * 1024 * 1024 / "$X" ))
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.658087 s, 1.6 GB/s
hugopeixoto@desktop:~$ rm zero.img
hugopeixoto@desktop:~$ X=65536; dd if=/dev/zero of=zero.img "bs=$X" count=$(( 1024 * 1024 * 1024 / "$X" ))
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.497855 s, 2.2 GB/s

I tried this a couple of times, because sometimes things get cached (and I should use sync), and 512 was always way slower than 4096 or 65536. From 64k up, it stays at 2.2GB/s, until it starts to go down again at bs=8M.

So why is the SSD reporting 512? Maybe it’s to be compatible with older systems, who knows. The SATA ones report 0, and fdisk defaults to 4096.

https://superuser.com/questions/451883/can-i-change-my-ssd-sector-size

The 512B sector size reported by the SSD is only for compatibility purposes. Internally data is stored on 8kiB+ NAND pages. The SSD controller keeps track of the mapping from 512B to pages internally in the FTL (Flash Translation Layer).

The “Sector Size” is a fake number reported by the SSD controller so that legacy SATA controllers will play nicely with your SSD.

OK, so it seems that this is it. I don’t think there’s any way to get the real value for this.

Since this is read by fdisk, I assume it’s important information to create disk partitions, to make sure they’re aligned. Sounds like an adventure for another day.

Summary

du reports both apparent (byte) size and real (block) size. It takes the filesystem’s reported block count and block size and normalizes it to 512 byte blocks. Then, to decide what to output, it converts it again to a block size specified by the user, using a command line argument, the DU_BLOCK_SIZE env variable, or even POSIXLY_CORRECT.

jmtpfs does not support reporting blocks, because MTP hides fs details by design. It always sets st_blocks=0, which makes du unable to report disk usage in blocks.

ext4 uses block sizes of 4096. It can be set at mkfs time, but it shouldn’t exceed the computer’s page size. It doesn’t support fragments, but it supports bigalloc mode (Clusters), which is kind of the same thing but in the opposite direction. Fragments allow for less block waste when you have many small files, and clusters allow for less block waste when you have many big files. It also supports sparse files, and has an block mapping optimization called extent.

fdisk: SSDs report block sizes of 512, to be compatible with legacy controllers. This information may also be used by disk partitioners, although I didn’t explore that.

smartphone: I still need to back up the contents.

File sizes and disk usage

Table of contents

MTP, FUSE, and filesystem syscalls

ext4 and sparse files

fdisk: units, sector sizes and I/O sizes

NVME device drivers

stat: block sizes and fundamental block sizes

fdisk attempt number two

Summary