In this article, we continue to explore the powerful text search capabilities of Word for WhatsApp Chat Analysis . In the last article we saw WHERE to search. This time we will see HOW to search.
Contents
What can you search for using Word while doing WhatsApp Chat Analysis ?
Instead of describing each feature in the Find dialog, I am going to show what is possible and I want you to discover the features yourself.
Usually we just go to the Find dialog or Find Pane and type a word or a phrase. That is all. Unfortunately, that is not even like scratching the surface of the capability
The Find dialog looks like this. I am sure all you know it. There is a button called More and there is another one called Cancel. Rarely do we feel like clicking on the MORE button… because our brain reads it like MORE TROUBLE. So you have two choices… Explore or Run away from features. This time we are going to have the courage to explore – because it will be of immense benefit to you.
When you click on More – you really get a lot more! Don’t be inundated by the plethora of features. They are there to help you. Not scare you away.
Search options give you more flexibility while searching. Format button allows you to search by any type of formatting including Styles and Highlights (we have seen this in an earlier article – How to read a document containing highlights?)
Special button provides you to search for items which you cannot type in the find box directly – like paragraph marks, tabs and so on. Remember that the special list is different for Find and Different for Replace. Depending upon where your cursor is, this list changes. Explore it. It provides you with special codes which you can learn once and then use without going to Special list. For example ^p is typed to search for paragraph marks.
The Use Wildcard button is very very special and it changes the Special list completely. I will discuss this separately later in this article.
For now let us focus on the search options. I will not tell you which option does what. I will tell you what is possible and you have to discover the appropriate option. No spoon feeding.
What kind of things can we search for
Of course you can make it sensitive to the upper lower case.
And also check if you want specify a whole word or part of the word. If you search for and, do you also want to find sandy and brandy? That is part of the word.
If you want to find words with similar verbal pronunciation but different spellings you have an option. For example, you search for bear, you can find beer or bare also. This works with multiple languages, depending upon the installed languages.
It is also possible to search for, say run, and find run, ran, running as well.
Sometimes we want to search only in the beginning or ending of the word. For example if you search for and but you also want to find android, androsterone, andaman. Alternatively you may want to find band, sand, brand, stand only.
Things become more interesting when you want to find the word malware but it may have been spelled as mal ware or “mal”ware or mal-ware.
Nice and easy. This is very useful in social media analytics because people use all kinds of spellings and variations. Often the typing is done on cumbersome mobile keyboards – so spelling mistakes, abbreviations and slang is very commonly used.
The wildcards
This part requires at least 20 minutes of focused learning.
These are the really powerful search helpers. These don’t expect you to specify the entire word or phrase. You may want to find any word which starts with a, ends with y and has three characters between a and y.
Or you may want to find a word where the first character can be any single digit number followed by a dot (period).
How does this work?
First of all you have to activate the Wildcard mode.
Now click on the Special button and look at what is available. Don’t worry. At first glance it makes absolutely no sense. But it is actually simple once you understand the concept.
Basically Word is asking you… what do you want to search for.. what are the components of the word which you know about. Unfortunately Word does not understand our language – so we have to speak in a language which Word understands. Little effort required. But the results are amazing.
Only the ones which are shown with white background are wildcard related. Others are just special characters for which you have to insert a simple code – which you don’t know about – so you find out by choosing them from here. For example, if you want to search for Tab – the code is ^t. Once you know the code, you don’t need to worry.
Now let us focus on the wildcard options. This one you probably know from Excel. Question mark means one character and asterisk means any number of characters.
So if you type ?o? then it will search for all three letter words where the middle letter character is letter o (not zero).
Let us try a Find All (Find All = Find In – Main Document)
But something is wrong here … It is doing its job but it is finding it in part of a word as well as across word because ov also fits into what we asked for ?o?
Now we need to inform Word that we only want full words. Not part words. Unfortunately, the option called Find Whole Words Only is inactive when Use Wildcards option is ON. Sounds like we are stuck!
Don’t worry. We have another option there to help us within the Special section. Let me show you again…
So now our search should happen on <?o?>
.. and it works beautifully. Useful or not?
The asterisk * character based search does not behave well. It does not stop at word boundary. So it is confusing. Don’t use it as far as possible.
Repetitive characters can be found using @ character. So sto@p will match stop and stoop
Square brackets can be used to specify a range of characters.
For example, if you specify <[0-9]. followed by a space … (type a space) in a manually typed numbered list it will search for all the numbers. What does it mean? Start with beginning of a word which has a number in the first position followed by a period and then a space.
Notice that it did not find 10. because there are two numbers there. So how do you find 10.? Let us try <[0-9][0-9]. followed by space
Unfortunately, it finds only 10 because we are forcing two digits. We want a method of saying it could be either one or two digits between the range of zero and nine.
That is where curly braces come in. These are used to say how many instances of the previous character can be used. For example if you type fo{1,3}d then it will match fod, food and foood.
Now we can specify what we exactly want to get rid of the manually typed numbers in our list. The search condition, <[0-9]{1,2}. followed by space we will get what we want.
So what will this find? 1[0]{1,4} – it is like saying find a word which starts with 1 and has one, two, three or four zeros after it. So it will find 10, 100, 1000 and 10000.
So far so good. Now let us explore another great use of this multiple instances feature. Often, while analyzing social media content, people just press extra enter keys or dots to space out their content while posting. We can convert multiple Enter keys (which are basically paragraph marks for word) into a single one using a very nice technique.
The special character for paragraph mark is ^p, but it cannot be used with wildcards. So we use another one called ^13
We don’t want to disturb the single paragraph mark which occurs after the first sentence. We only want to find those where more than two paragraph marks appear together. Once we find them, we will replace them with a single paragraph mark.
So here is what we need to search for ^13{2,}
Notice that we have started with 2 instances and left the ending number blank. That means it will take any number of instances starting with 2.
While replacing, we can use the ^p code. So this is what it looks like.
And after replacing it with ^p it becomes like this – which is exactly what we want to clean up long conversations and chats.
This technique can be used to replace multiple tabs with one tab for converting text to table. It can also be considered for multiple periods which often appear in jokes or lengthy paragraphs posted in social media.
Now the next concept is that of a record. Each paragraph is like a single row for Word. When you search, you can identify parts of what you want by enclosing them in brackets.
For example, consider these numbers. These are separate paragraphs.
Let us say we want to reverse what is on either side of the decimal point. How do we do it?
We have two numbers separated by the decimal point. That is the constant part. So any digits on left side, then the decimal point and then any number of digits on the right side.
That is easy with what we already know. Let us try only the left number first and the decimal like this [0-9]{1,}.
You already know what this means. Any digit- any number of times followed by a dot. If we find all right now… this is what you get.
Now if we complete the thing by adding the second set of digits, it looks like this
[0-9]{1,}.[0-9]{1,} Let us try a find all now.
It did find what it what we expected. But this time we don’t just want to find. We want to reverse the numbers around the decimal. That means we have to type something in the Replace textbox. What do we write there? We don’t know the actual numbers because those are different in each case.
That is why, we need a method of saying that thing on the left side of the decimal is one part and the numbers on the right side of the decimal is another part and we want to put part 2 first and then part 1.
We specify these portions by putting brackets around them.
([0-9]{1,}).([0-9]{1) The purple bracket is the first part and and the red brackets identify the second part. The bracket colors are only for explanation purpose.
Now let us try a Find All.
Practically nothing has changed. But internally Word now knows what you mean by part 1 and 2. Now we can tell Word how you want this replaced. So type 2.1 in replace area. This basically means put the part 2 initially, then the decimal point and then the first part.
And this is what we get.
If I had replaced it with 20.1 then a zero would get added to the first number! Remember, this is not Excel. This is Word. For Word, everything is text.
Now about exceptions. The [!abc] is like saying any character except abc
See this example and you will now understand what it does. <??[!x]> means find all three character words which don’t end in x
Notice that fox was ignored. Also notice that all this is case sensitive…
We have progressed a lot now. I am sure you will be able to do various types of useful searches with ease now. Practice more and you get better.
WhatsApp transcript
With this in mind, let us look at whether we can just dump the WhatsApp transcript first in Word and then created a nicely Tab delimited version of it which can be imported easily into Excel. A typical post looks like this. Of course the exact date and time display may differ based upon your mobile settings. But now you have enough knowledge to understand the pattern and tweak your approach accordingly.
Now with this search expression
([0-9]{1,2}:[0-9]{1,2})([ap]m), ([0-9]{1,2} ???) – (*):(*)^13
and this replace expression
1 2^t2^t3^t4^p
What happened? The data is now delimited by tabs into four columns: Time, Date, Name and the chat text.
How did this work?
9:22pm, 23 Mar – Person1: How are you
([0-9]{1,2}:[0-9]{1,2})([ap]m), ([0-9]{1,2} ???) – (*):(*)^13
This is the first component which looks at one or two digits followed by a : followed by one or two digits. This is the time portion.
The next one is looking for either am or pm. This is the time unit component.
In this case it is 9:22pm
The third component looks at the date portion where there could be one or two digit and a 3 character month. In this case it is 23 Mar
This part looks at anything after the date followed by a hyphen till it finds a :
That would give us the name Person 1
Finally we take anything which comes after the : till the line is finished and that is the chat text – which, in this case, is How are you
At the time of replacing the data, we are using the same four components. But adding a tab between them using the ^t code. Finally we add a paragraph mark to end the record.
1 2^t3^t4^t5^p
Notice that there is a space between 1 and 2. This is because we wanted to separate the time and the time unit. Earlier it was 9:22pm, not it will become 9:22 pm
Why does this matter, this space is important for Excel to understand the data as time. If the space is missing, Excel will not convert it to a valid time value.
Analyzing the data
Now you have two choices. Copy paste to Excel or Save as text file and import it into Power Query.
Using Excel directly
If the data is not large, Copy paste the data directly into Excel. Because it is already tab Delimited, it will automatically go into appropriate columns. The problem is, time is not being recognized as time. It is still text. You will need to parse it and use the time() function to create proper time.
Using Power Query
For large data, save the file as a text file and import it into Excel Power Query . There will be paragraph marks within messages posted by users. Those rows will still be there. All the text will be dumped in the first column. After importing into PowerQuery, split the column using Tab as the delimiter – which will work well because we have ensured Tabs go in the correct place.
Power Query has built in capability to convert data from text to various other data types. Now right click on the first time column and choose Change Type – Time. This will work even if there was no space between the time and the time unit.
Power Query will attempt to convert all values in first column to dates. Where the user typed text has appeared in the first column, the conversion will fail and you will see Error in many cells. No problem. Just choose Remove Errors from the menu. All errors are now removed.
Now check the data again. There may be some nulls. Remove those by filtering them out.
This will get rid of all the unwanted rows.
Unfortunately, the first line of a multi-paragraph post will be retained and other rows will be deleted. This is the limitation of the approach we have seen.
Further reading
This capability of wildcard search is technically called Regular Expressions. Here are two nice articles to help you to learn more about this great feature.
Putting Regular Expressions to work in Word
Finding and Replacing characters using Wildcards by Graham Mayor, and Klaus Linke
What next
As you can see we did a good job. But the multi-line posts continue to be a problem. We have to import them first and then get rid of them. Tomorrow, I will post a PowerShell script written by my friend and Microsoft Platform Expert Raj Chudhuri, which solves this problem as well.
— —