M-Pesa Statement Analysis
introduction
This project is designed to showcase the abilities of Arbutus Analyzer’s data cleaning and correction capabilities. The data manipulation here will involve cleaning the PDF statement, adding appropriate columns from the statement into an Arbutus Flat File, checking for conversion accuracy and publishing the analysis results on Power BI.
the challenge
M-pesa monthly statements are received as PDF documents via email. They are a ledger like document detailing transactions (both receiving and paying), the amounts and the resulting balances. It’s difficult to glean any sort of quick insights (like how much is spent on various expense e.g power, groceries, entertainment) and how much is received from each source.
The project also seeks to show that it is possible to obtain consumer patterns and identities by manipulating the data contained therein and employing a few powerful algorithms.
methodology
Download the PDF statement.
Create a project folder on Arbutus Analyzer.
Import the PDF statement as a data source
Using Arbutus’ PDF import tool, train a model in the software into reading and interpreting the columns and data types within the statement. This process involves highlighting various data columns, using relevant expressions to create sorted character columns and numerical data columns. After completion, any statement can be correctly read and automatically formatted into a flat file data frame that supports various analytical procedures.
Verification of the data. Done by simple statistical analysis of numerical columns to obtain matching totals, means, standard deviations etc. If the model was trained correctly then this step should pass without any issues.
Results
A simple and easily understood visual statement is produced. At a glance, one can identify income sources, expenditures, charges etc. By overlaying GIS data over the descriptive columns, one can even do simple pattern analysis like – favorite travel routes, possible home and work coordinates, nature and size of family, estimate of incomes and other identifying markers.