This is a post for developers or those who know PowerShell Script. To understand the context, please read these two posts… This article and the code is contributed by Raj Chaudhuri – my friend and Microsoft platform expert.
WhatsApp Chat Analysis using Microsoft Word
In this article, I will explain a scripting based approach to parsing WhatsApp transcript.
Typical format is like this…
9:22pm, 23 Mar – Person1: How are you
9:23pm, 23 Mar – Person2: I am learning Word search. Do you know it?
9:23pm, 23 Mar – Person1: No. I don’t. You teach me.
I am willing to learn.
I will spend time and effort on it… if you teach me…
9:24pm, 23 Mar – Person1: Sure… I will teach you
Contents
The problem
As described earlier, if you parse this using Excel or Word, the data would be like this.
We will need to delete rows 6 and 7. But that way, the chat content is lost.
Otherwise, we will have to merge the contents of A6 and A7 with the cell D5… which is also cumbersome – considering that there will be hundreds of such records in a large data set.
In short, it is a compromise.
If Power Query provides a robust regular expressions implementation, this problem can be solved. However, at the time of writing this article, Power Query (the M language) does not support regex.
The PowerShell Script
PowerShell is the simplest way available of writing powerful scripts, with full regular expression capability.
$c = Get-Content .raj1.txt -Raw
$rx=”(d{1,2}:d{2}w{2}),s(d{1,2}sw{1,3})s-s(w*):s”
$cstr = $c -split $rx
$resarr = @()
for($i=1; $i -lt $res.Length; $i+=4) {
$n=new-object psobject -property @{Time=$res[$i]; Date=$res[$i+1]; Sender=$res[$i+2]; Message=$res[$i+3].Trim() }
$resarr += $n
}
$resarr | export-csv
I am not going to explain the code here. Those who understand PowerShell will know what his happening here easily.
Instructions
Copy the file and save it as a PS1 file. Change the path of the content file. Run the script.
You will see that the multi-row chat entry is properly parsed into a single record.
If you have any queries, do post them as comments.
Raj Chaudhuri
Raj is a close friend of mine. He works as an independent consultant, based in Bombay. His areas of expertise include BI, .NET programming and SharePoint platform design and development.
2 Responses
Okay, I’ll explain it then.
The file raj1.txt contains an exported WhatsApp chat, in the format shown in the post. The first line of the script:
$c = Get-Content .raj1.txt -Raw
reads the entire file, including line breaks, into a single string variable called $c. The real magic happens two lines down:
$cstr = $c -split $rx
The PowerShell -split operator splits a single string into an array, based on a delimiter. The magic comes from the fact that the delimiter can be a pattern, or a regular expression. In our case, the delimiter is contained in the variable $rx, defined one line above:
$rx=”(d{1,2}:d{2}w{2}),s(d{1,2}sw{1,3})s-s(w*):s”
This is what it means: in the string being split, look for a pattern of 1 to 2 digits (d{1,2}) followed by a colon (:) followed by two digits (d{2}) followed by two letters (w{2}) followed by a comma and a single space (,s) followed by a pattern of 1 to 2 digits, a single space and 1 to 3 letters (d{1,2}sw{1,3}) followed by a space, a dash and a space (s-s), followed by a pattern of any number of letters (w*) followed by a colon and a space (:s).
So let’s look at the following lines, and split them using the regular expression above:
9:22pm, 23 Mar – Person1: How are you
9:23pm, 23 Mar – Person2: I am learning Word search. Do you know it?
9:23pm, 23 Mar – Person1: No. I don’t. You teach me
These will split as follows:
9:22pm – because it matches (d{1,2}:d{2}w{2})
23 Mar – because it matches (d{1,2}sw{1,3}) and follows the previous pattern and ,s
Person1 – because it matches (w*) and follows the previous pattern and is followed by :
How are you – because it is what is left over until the next (d{1,2}:d{2}w{2})
9:23pm – because it matches (d{1,2}:d{2}w{2})
and so on.
The rest of the script reads the array four at a time, and produces a CSV file that looks this this:
“9:22pm”, “23 Mar”, “Person1”, “How are you”
“9:23pm”, “23 Mar”, “Person2”, “I am learning Word Search”
etc. This CSV file can then be imported into Power Query to do the analysis.
Actually, now that I think about this, a CSV file may not be a good idea, because the last field may have embedded line breaks. Doctor, can your Power Query import XML files?
Thanks for the detailed explanation Raj.
Yes. Power Query supports XML.