Fixing a bug in Rocket Chat
Published on October 29, 2020
I’m a member of D3, a portuguese association focused on defending digital rights. We use Rocket Chat as our main communication tool (think Slack, but open source). Recently, Rocket.Chat announced that they’re limiting the number of monthly mobile push notifications for self hosted instances. Their new limit is 5000 notifications per month, and we’re way over that number. September was a particularly active month for us, thanks to the Stayaway Covid mess, and we almost hit 20k notifications.
Side note: with Rocket Chat being open source software, the push notifications limit may seem strange. This is because the official Android and iOS applications are published by Rocket.Chat the company, and only they can send push notifications through APNS/FCM. If self-hosters want to reach users of the official apps, they need to use a gateway provided by the company. If we wanted to self-host the push notification server to avoid depending on their gateway, we’d have to publish our own mobile applications. If you want more info on this, read Thomas’s blog post or check the Open Push project.
Rocket.Chat announced some paid plans to raise the limit, but before considering them, we decided to upgrade our instance (we were a bit behind) to see if the latest version is smarter when deciding whether to send a push notification. There was a redesign of this logic in 3.2.0, but I’m not sure if this will be enough to keep us under 5000 per month.
Meanwhile, after the upgrade, we were checking if everything was working fine
and we found a bug in the administration panel, in the federation dashboard.
Accessing the dashboard causes the frontend to freeze and the backend logs to
be filled with Error, too many requests
exceptions. The logs pointed to a
problem in federation:getOverviewData
and federation:getServers
calls. A
quick search on github led me to this issue:
https://github.com/RocketChat/Rocket.Chat/issues/19007. It seems that we’re
not the only ones to experience this problem. The comments were mostly “me too”
and stack traces, so it seemed like no one had started to debug the problem. I
could have just posted a comment saying we were experiencing the same issue,
but I decided to take a look at their codebase to try to add some more
information to the report.
I had already git cloned https://github.com/RocketChat/Rocket.Chat and looked around a bit. It is a monorepo meteor application. I don’t know much about meteor, but being a monorepo is great for code spelunking, as I didn’t have to jump around between repositories.
From the behavior we saw, I expected that the bug would be in the frontend.
Something like a misconfigured retry, perhaps. I started by searching for
federation:getOverviewData
. This looks like a meteor method call, so I was
expecting matches in both the server and the client:
1
2
3
$ ack -l federation:getOverviewData
app/federation/server/methods/dashboard.js
client/admin/federationDashboard/OverviewSection.js
The client/../OverviewSection.ts
file contains a react functional component,
OverviewSection
. It’s a small file, and here’s the part that’s relevant to
this bug hunt, reformatted for clarity:
1
2
3
4
5
6
7
8
9
10
11
import { usePolledMethodData } from '../../contexts/ServerContext';
function OverviewSection() {
const [overviewData, overviewStatus] = usePolledMethodData(
'federation:getOverviewData',
[],
10000,
);
// bunch of ifs and presentational components
}
usePolledMethodData
and its arguments are key here. This is using the use*
prefix, a pattern usually reserved for React hooks. Following
the trail of usePolledMethodData
, here is the relevant code from
ServerContext.ts
(adapted from the 3.7.1 release):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import { createContext, useCallback, useContext, useState, useEffect } from 'react';
export const ServerContext = createContext({
callMethod: async () => undefined,
// bunch of other stuff
});
export const useMethod = (methodName: string) => {
const { callMethod } = useContext(ServerContext);
return useCallback(
(...args) => callMethod(methodName, ...args),
[callMethod, methodName]
);
};
export const useMethodData = (methodName: string, args = []) => {
const getData = useMethod(methodName);
const [[data, state], updateState] = useState([
undefined,
AsyncState.LOADING
]);
const fetchData = useCallback(() => {
updateState(([data]) => [data, AsyncState.LOADING]);
getData(...args).then((data) => {
updateState([data, AsyncState.DONE]);
}).catch((error) => {
updateState(([data]) => [data, AsyncState.ERROR]);
});
}, [getData, args]);
useEffect(() => {
fetchData();
}, [fetchData]);
return [data, state, fetchData];
};
export const usePolledMethodData = (methodName, args = [], intervalMs) => {
const [data, state, fetchData] = useMethodData(methodName, args);
useEffect(() => {
const timer = setInterval(() => { fetchData(); }, intervalMs);
return () => { clearInterval(timer); };
}, [fetchData, intervalMs]);
return [data, state, fetchData];
};
There are a few things that stand out. It’s using setInterval
, so maybe the
bug could be from setting the wrong number of milliseconds. There are many
useEffect
s and useCallback
s, so the bug could be in their dependencies. For
example, if fetchData
changes in every frame, a new timer is created in every
tick (with the previous one being cancelled). There doesn’t seem to be any
retry code. At first glance I didn’t notice anything wrong with setInterval
’s
parameters, so I was leaning towards the hook dependency hypothesis.
This was everything that I could find without running the code, and I felt like it was enough information to provide some value to the bug report, so I posted this suggestion in the github issue. I could’ve stopped here, but I decided to try to find the actual issue and fix it.
This is a meteor app, and their installation instructions is based on a curl |
sh
command, so I decided to write a Dockerfile to avoid installing it in my
system:
1
2
3
4
5
6
7
8
FROM debian
RUN apt update
RUN apt -y install curl procps git vim
RUN curl https://install.meteor.com/ | sh
WORKDIR /app
ENV METEOR_ALLOW_SUPERUSER=true
RUN meteor npm install
1
2
$ docker build . -t rocket-chat
$ docker run --rm -it -v $PWD:/app -p3000:3000 rocket-chat:latest
Once inside the container, I can run the app with meteor npm start
, and it
handles everything for me, including launching a mongodb server. Accessing
http://localhost:3000
lets me set up the Rocket Chat account and I’m in. This
whole process was way easier than what I was expecting. When I was fixing a
bug in forem, running things wasn’t this easy. Now that I had
Rocket Chat running, I could access the federation dashboard and reproduce the
issue.
Usually I would try to reproduce this in a test, but this codebase doesn’t seem to have any tests for their hooks. It would take me more time that what I wanted to spend to learn how to test this properly, so I ended up changing the code and reloading the browser. This was not ideal, since each reload would almost crash my browser unless I was fast enough to hit the “pause” button on the debugger before everything froze, but fortunately I only had to forcefully stop firefox once.
After some iterations of adding console logs and removing useEffect
dependencies to be sure, I tracked down the problem to the useEffect
function
in useMethodData
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import { createContext, useCallback, useContext, useEffect } from 'react';
export const ServerContext = createContext({
callMethod: async () => undefined,
// ...
});
export const useMethod = (methodName: string) => {
const { callMethod } = useContext(ServerContext);
return useCallback(
/**/,
[callMethod, methodName]
);
};
export const useMethodData = (methodName: string, args = []) => {
const getData = useMethod(methodName);
const fetchData = useCallback(() => {
console.log("calling fetchData");
// ...
}, [getData, args]);
useEffect(() => {
console.log("calling fetchData from useMethodData.useEffect");
fetchData();
}, [fetchData]);
// ...
};
export const usePolledMethodData = (methodName, args = [], intervalMs) => {
/**/ = useMethodData(methodName, args);
// ...
};
function OverviewSection() {
/**/ = usePolledMethodData('federation:getOverviewData', [], 10000);
// ...
}
Both console.log
s were being executed nonstop, without any delay. The
function passed to useEffect
is only executed when a dependency changes, so
this meant that fetchData
was changing. As fetchData is being memoized by
useCallback
and it was changing, then one of its dependencies was also
changing : either getData
or args
. Removing every dependency by replacing
both [getData, args]
and [fetchData]
with []
, stopped the infinite http
request loop.
After some more iterations of adding and removing dependencies, I narrowed down
the problematic dependency to args
. I wasn’t really expecting this; usually
it’s a function that’s not being properly memoized, and I was expecting it to
be getData
. args
was always being set to []
, so I had little reason to
suspect it was the problem.
To understand why args
was causing the bug, I checked what’s used by
useEffect
to check if dependencies change. It uses Object.is
. This is
similar but not exactly the same as ===
(there are some differences when
handling signed zeros and NaN
s). The following snippet
illustrates the problem:
1
2
3
4
5
6
> Object.is([], [])
false
> const args = [];
undefined
> Object.is(args, args)
true
This explains the problem. OverviewSection
is passing a literal []
as
args. Each call to
[] creates a new array, so
OverviewSection will never
pass the same array to
usePolledMethodData, causing
fetchData to be
recomputed every time, causing
useEffect`’s body to run every time as well,
causing the bug.
Now that the problem with the code has been identified, there are many ways to fix it.
We can use [getData, ...args]
instead of [getData, args]
. This will remove
the array from the equation, but may cause additional problems if any of the
arguments is an array or an object.
We can memoize []
in the caller, ensuring that we’re always passing the same
[]
. We can do this by using useMemo(() => [], [])
instead of []
. This has
the problem of not solving the issue for every caller of useMethodData
, so
the bug can reappear the next time someone tries to use this hook or when
existing code gets refactored, as tests don’t currently cover this part of the
codebase.
We can use [getData, JSON.stringify(args)]
. Strings with the same contents
are considered the same when using Object.is
. This has the extra overhead of
having to create the json string, and can cause some issues if try to
serialize cyclic objects, but this will have to be serialized
to be sent to the meteor server (we’re in the context of a method call), so the
cycle problem doesn’t apply here.
The third option seemed the best one, and I started writing the PR
description. One thing I like to include in the PR description is how the
bug appeared in the first place. Even without tests, this had to work at some
point; I don’t think someone would just push broken code from the start. My
guess was that there was some change (maybe a refactor, or a lint fix, or
something) unrelated to the feature itself and this page wasn’t checked. I
checked the git history of both ServerContext.ts
and OverviewSection.js
and
found a change in ServerContext that shows that the original
code was [getData, ...args]
, one of the solutions I mentioned. This PR is
mostly changing type stuff, so the author probably didn’t go through the
Federation dashboard page. At this point I had enough info to make a PR:
https://github.com/RocketChat/Rocket.Chat/pull/19386
Another thing that I should have checked is that I was working on an up to date branch. I noticed that I was working on an outdated commit when I saw the “This branch is out-of-date with the base branch” message on github. I usually clone the repository immediately before working on the bug I’m trying to fix so this isn’t a problem, but I had cloned this repository a month ago to check how hard it would be to implement channel visibility based on roles, and it didn’t register that this had been so long ago, so I didn’t bother to git pull.
When rebasing, I noticed that OverviewSection.js
had been changed… to use
useMemo(() => [], [])
. This change was included in another refactor
PR, which was merged 16 days ago. The latest release was 20 days
ago, so there’s no release that includes this fix yet. There are no references
to this PR in the original issue, so I didn’t expect the issue to already be
fixed. I should have double checked the branch I was working on.
I think that JSON.stringify
on the function is a better solution than
useMemo
on the caller, so I updated my PR to remove the useMemo calls. Right
now, there are 230 pull requests open, so I’m not sure if this will be picked
up or not. At least I built up some extra knowledge on react hooks.