Highlight synthesized speech missing words

Hi, I'm using one of the delegate methods to determine when words are about to be spoken and then trying to provide a color background to indicate what is about to be spoken:

Code Block
func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, willSpeakRangeOfSpeechString characterRange: NSRange, utterance: AVSpeechUtterance) {
guard let rangeInString = Range(characterRange, in: utterance.speechString) else { return }
print("Will speak: \(utterance.speechString[rangeInString])")
let attributes: [NSAttributedString.Key: Any] = [
.foregroundColor: UIColor.darkText,
.font: UIFont.systemFont(ofSize: 16)
]
let mutableAttributedString = NSMutableAttributedString(string: utterance.speechString, attributes: attributes)
mutableAttributedString.addAttribute(.backgroundColor, value: UIColor.yellow, range: characterRange)
jpTextView.attributedText = mutableAttributedString
}

Based on the debug log I can see that it is reading every word in my UITextView, but when it comes to the yellow background highlighting, it's quite inconsistent and seems to skip around half the words. Is this an inherent issue related to the language I'm using (Japanese) or is there something fundamental in my code that needs to be fixed?

Replies

Can you run your code and print out time stamps along with the ranges of the text? Do the callbacks you receive match the text as you are hearing it? In other words, do the ranges seem correct or do they come in late/two at a time and that’s why they are skipping? An easy way to check this would be to paste the same text in Notes and speak it using speak selection which can be found in Settings>Accessibility>Spoken Content.

if the ranges are skipped using speak selection, and time of the callback or the accuracy of the range seems incorrect, please file a bug with the text you are trying to speak and the voice you are using, and paste the feedback ID here if you can.

If the ranges appear correct and the text is properly highlighted with speak selection, I’m not entirely sure what the issue is. Perhaps someone more familiar with UITextView can chime in
OK so using some sample text:
国内と同時放送するニュースの発信を強化し、最新の動きを詳しく伝えます。内外で頻発する自然災害や、大きな事件・事故などの際には、機動的にニュースを編成して的確に情報を発信し、日本語ライフラインとしての役割を果たします。

The output with timestamps is:
2020-07-23 23:31:15.9130: 放送
2020-07-23 23:31:15.9140: する
2020-07-23 23:31:15.9150: ニュース
2020-07-23 23:31:16.4810: の
2020-07-23 23:31:16.4820: 発信
2020-07-23 23:31:17.0840: を
2020-07-23 23:31:17.0860: 強化
2020-07-23 23:31:18.0080: し
2020-07-23 23:31:18.0100: 、
2020-07-23 23:31:18.0110: 最新
2020-07-23 23:31:18.7510: の
2020-07-23 23:31:18.7520: 動き
2020-07-23 23:31:19.2750: を
2020-07-23 23:31:19.2770: 詳し
2020-07-23 23:31:19.8240: く
2020-07-23 23:31:19.8250: 伝え
2020-07-23 23:31:20.5670: ます
2020-07-23 23:31:20.5680: 。
2020-07-23 23:31:20.7660: 内外
2020-07-23 23:31:21.5350: で
2020-07-23 23:31:21.5370: 頻発
2020-07-23 23:31:22.3780: する
2020-07-23 23:31:22.3800: 自然
2020-07-23 23:31:23.7600: 災害
2020-07-23 23:31:23.7620: や
2020-07-23 23:31:23.7620: 、
2020-07-23 23:31:23.7640: 大きな
2020-07-23 23:31:24.2730: 事件
2020-07-23 23:31:25.4670: ・
2020-07-23 23:31:25.4680: 事故
2020-07-23 23:31:25.4690: などの
2020-07-23 23:31:25.4690: 際
2020-07-23 23:31:26.2600: には
2020-07-23 23:31:26.2610: 、
2020-07-23 23:31:26.2620: 機動
2020-07-23 23:31:27.3160: 的
2020-07-23 23:31:27.3180: に
2020-07-23 23:31:27.3190: ニュース
2020-07-23 23:31:27.8350: を
2020-07-23 23:31:27.8370: 編成
2020-07-23 23:31:28.7740: し
2020-07-23 23:31:28.7750: て
2020-07-23 23:31:28.7760: 的確
2020-07-23 23:31:29.5270: に
2020-07-23 23:31:29.5290: 情報
2020-07-23 23:31:30.1660: を
2020-07-23 23:31:30.1670: 発信
2020-07-23 23:31:31.1030: し
2020-07-23 23:31:31.1040: 、
2020-07-23 23:31:31.1050: 日本語
2020-07-23 23:31:32.5260: ライフライン
2020-07-23 23:31:32.5280: と
2020-07-23 23:31:32.5290: し
2020-07-23 23:31:32.9200: ての
2020-07-23 23:31:32.9210: 役割
2020-07-23 23:31:33.6030: を
2020-07-23 23:31:33.6040: 果た
2020-07-23 23:31:34.3960: し
2020-07-23 23:31:34.3980: ます
2020-07-23 23:31:34.3980: 。

So it's not skipping any words but as you suggested in the first paragraph, it looks like there are groups of words that are being spoken closely and it is moving to the next word so fast that I can't see it being highlighted.

It seems unnatural though because the audio is reading the words at a more steady pace but the willSpeakRangeOfSpeechString function seems to be grouping words together and the pace is less steady.

Is there a way to improve this, or is it a current limitation of the framework?
Unfortunately there isn't anything you'll be able to do to work around this problem. The timestamps are generated along with the synthesized speech and it sounds like there is an error in some logic there. I'll forward this info along to the relevant folks.
Thanks for that, please keep us informed if there are any future updates.
I just thought about it a bit more and realised there is probably a very quick fix for this. If the timestamps between words are below a certain threshold, they should all just get grouped together so you highlight them in bigger blocks rather than having shorter words that might get quickly spoken and moved onto the next one.