Monday, November 10, 2008

Powerpoint spell check and text substitution problem

A situtation can occur when identifying substitution text in a presentation. In my example the text token is surrounded with a start/end token(i.e. [@token]). If the token is a word in the Office dictionary then there is no problem. On the other hand if the token word isn't in the dictionary and gets the red underline from spellcheck then Powerpoint breaks the string into multiple text runs. It's put into multiple runs because the spellchecked text has a RunProperty that identifies it as being a spelling error and must be repressented by the schema as such. The following will show the issue.


Slide to be parsed


Here's the breakdown of the text runs
Text Before: Token fun
Text Before: Sometimes a [@Token] string will get broken up.
Text Before: When spell check flags a misspelled word the
Text Before: [@
Text Before: MisspelledToken
Text Before: ] gets split between multiple text blocks


As you can see there is a token that gets spanned across multiple text runs and thus doesn't get replaced. This can be a major hang up. A quick solution for this one example is that I know when there is a start-token at the end of a text run then the next couple text runs are the token itself and then the end-token. So I can append those three together and then replace the token as necessary.


void SubstituteText(SlidePart slidePart, string tokenStart, string tokenEnd, string tokenValue, string subValue)
{
string token = tokenStart + tokenValue + tokenEnd;
int tokenStartId = 0;

//Collect all the Paragraph sections containing the token
List paragraphList = slidePart.Slide
.Descendants()
.Where(t => t.InnerText.Contains(token)).ToList();

foreach (Drawing.Paragraph p in paragraphList)
{
//Collect all the Text
List textList = p
.Descendants()
.ToList();

//Iterate and find tokenStart Text block or replace text if whole token found
foreach (Drawing.Text t in new List(p.Descendants().ToList()))
{
if (t.Text.EndsWith(tokenStart))
{
tokenStartId = textList.IndexOf(t);

//append next two text segments and remove them
Drawing.Text appendText = textList[tokenStartId + 1];
t.Text = t.Text + appendText.Text;
//must remove at the Drawing.Run level
appendText.Parent.Remove();
appendText = textList[tokenStartId + 2];
t.Text = t.Text + appendText.Text;
//must remove at the Drawing.Run level
appendText.Parent.Remove();

textList.RemoveAt(tokenStartId + 1);
textList.RemoveAt(tokenStartId + 2);
}
//substitute text
t.Text = t.Text.Replace(token, subValue);
}

}
}


After running this I get the following output as desired.

.pptx after substitution


Breakdown of the text runs after substitution
Text After: Token fun
Text After: Sometimes a TOKEN string will get broken up.
Text After: When spell check flags a misspelled word the
Text After: BROKEN TOKEN gets split between multiple text blocks

Now this doesn't cure all ails. I have also seen it where after going back and making changes to existing text that sometimes the start-token and token are together but the end-token is seperated to another text run. We also can't just take all the runs of a paragraph and smash them together into a single run and then substitute. This would ignore any text styling that was done within the paragraph. I have tried parsing a paragraph such that it looks at the RunProperties between consequetive runs to see if it's valid to append them together. I am still working on this and will present the final solution at a later time.

Hopefully this post has revealed some nuances of doing text substitution and that it isn't always straight forward.

1 comment:

Unknown said...

Powerpoint may also split the token if the user clicks in the middle of the word and for some other uncontrollable events. I had to deal with that at work. I used a normaliser approach there where I renormalize every run to either contain exactly the whole token or no token. If you care about formating you had to copy the rPr tag over to new runs. Take care of the endPara Properties. I ended with a relatively complex finite state machine to do it. But I had to support roundtrip editing ppt -> out app -> ppt -> ... so you might get away with a simpler version.