fbpx

WhatsApp chat analytics: PowerShell Script

This is a post for developers or those who know PowerShell Script. To understand the context, please read these two posts… This article and the code is contributed by Raj Chaudhuri – my friend and Microsoft platform expert.

WhatsApp Chat Analysis using Microsoft Word

Social analytics using Powershell script

In this article, I will explain a scripting based approach to parsing WhatsApp transcript.

Typical format is like this…

9:22pm, 23 Mar – Person1: How are you

9:23pm, 23 Mar – Person2: I am learning Word search. Do you know it?

9:23pm, 23 Mar – Person1: No. I don’t. You teach me.

I am willing to learn.

I will spend time and effort on it… if you teach me…

9:24pm, 23 Mar – Person1: Sure… I will teach you Smile

The problem

As described earlier, if you parse this using Excel or Word, the data would be like this.

Chat in excel

We will need to delete rows 6 and 7. But that way, the chat content is lost.

Otherwise, we will have to merge the contents of A6 and A7 with the cell D5… which is also cumbersome – considering that there will be hundreds of such records in a large data set.

In short, it is a compromise.

If Power Query provides a robust regular expressions implementation, this problem can be solved. However, at the time of writing this article, Power Query (the M language) does not support regex.

The PowerShell Script

PowerShell is the simplest way available of writing powerful scripts, with full regular expression capability.

$c = Get-Content .raj1.txt -Raw
$rx=”(d{1,2}:d{2}w{2}),s(d{1,2}sw{1,3})s-s(w*):s”
$cstr =  $c -split $rx
$resarr = @()
for($i=1; $i -lt $res.Length; $i+=4) {
$n=new-object psobject -property @{Time=$res[$i]; Date=$res[$i+1]; Sender=$res[$i+2]; Message=$res[$i+3].Trim() }
$resarr += $n
}
$resarr | export-csv

I am not going to explain the code here. Those who understand PowerShell will know what his happening here easily.

 Instructions

Copy the file and save it as a PS1 file. Change the path of the content file. Run the script.

You will see that the multi-row chat entry is properly parsed into a single record.

If you have any queries, do post them as comments.

Raj Chaudhuri

Raj Young

Raj is a close friend of mine. He works as an independent consultant, based in Bombay. His areas of expertise include BI, .NET programming and SharePoint platform design and development.

Star
Star
Star

2 Responses

  1. Okay, I’ll explain it then.

    The file raj1.txt contains an exported WhatsApp chat, in the format shown in the post. The first line of the script:
    $c = Get-Content .raj1.txt -Raw
    reads the entire file, including line breaks, into a single string variable called $c. The real magic happens two lines down:
    $cstr = $c -split $rx
    The PowerShell -split operator splits a single string into an array, based on a delimiter. The magic comes from the fact that the delimiter can be a pattern, or a regular expression. In our case, the delimiter is contained in the variable $rx, defined one line above:
    $rx=”(d{1,2}:d{2}w{2}),s(d{1,2}sw{1,3})s-s(w*):s”
    This is what it means: in the string being split, look for a pattern of 1 to 2 digits (d{1,2}) followed by a colon (:) followed by two digits (d{2}) followed by two letters (w{2}) followed by a comma and a single space (,s) followed by a pattern of 1 to 2 digits, a single space and 1 to 3 letters (d{1,2}sw{1,3}) followed by a space, a dash and a space (s-s), followed by a pattern of any number of letters (w*) followed by a colon and a space (:s).

    So let’s look at the following lines, and split them using the regular expression above:
    9:22pm, 23 Mar – Person1: How are you

    9:23pm, 23 Mar – Person2: I am learning Word search. Do you know it?

    9:23pm, 23 Mar – Person1: No. I don’t. You teach me

    These will split as follows:
    9:22pm – because it matches (d{1,2}:d{2}w{2})
    23 Mar – because it matches (d{1,2}sw{1,3}) and follows the previous pattern and ,s
    Person1 – because it matches (w*) and follows the previous pattern and is followed by :
    How are you – because it is what is left over until the next (d{1,2}:d{2}w{2})
    9:23pm – because it matches (d{1,2}:d{2}w{2})
    and so on.

    The rest of the script reads the array four at a time, and produces a CSV file that looks this this:
    “9:22pm”, “23 Mar”, “Person1”, “How are you”
    “9:23pm”, “23 Mar”, “Person2”, “I am learning Word Search”
    etc. This CSV file can then be imported into Power Query to do the analysis.

    Actually, now that I think about this, a CSV file may not be a good idea, because the last field may have embedded line breaks. Doctor, can your Power Query import XML files?

Queries | Comments | Suggestions | Wish list