According to the WSJT system requirements, the time must be synchronized within ±1 second.
It's generally the case that the more that is known about a signal, the easier it is to detect.
In the case of WSJT, the software records a 4 kHz or so wide swath, and then must find and decode all the signals in it. It must work this way because the nature of the transmission isn't really amenable to interactive tuning by a human operator like CW or SSB are. Furthermore, being able to simultaneously monitor dozens of transmissions and pick the most exciting is a big appeal of the mode.
WSJT transmissions follow a limited format, and are of a fixed length. To successfully decode them requires identifying the start and the end of the message. The transmissions also occur somewhere within a range of possible frequencies within the receiver passband.
With the start time fixed, the search for signals involves only one dimension: frequency. If signals could start at any time there are now two dimensions, frequency and time. The computational complexity is thus squared.
Of course the time isn't exactly synchronized, so there's still a little time synchronization that needs to happen. However, knowing the start of the message within ±1 second is better than knowing nothing at all about the time.