Getting started with Python for Accounting Research

Posted by Ties de Kok - Mar 24, 2019

The Python programming language is a very powerful tool to have in your toolkit as an Accounting researcher. Python is the data science equivalent of a Swiss army knife as it can be used to solve a wide variety of problems: data gathering, web scraping, data processing/cleaning, natural language processing, data analysis (e.g. machine learning), and data visualization. While other tools like R, Stata, and SAS might outshine Python in specific applications (e.g. statistical analysis) it is hard to beat Python as a general-purpose tool that can tackle practically any data science problem.

The general-purpose nature of Python can, however, make it overwhelming to get started for the specific purpose of using it for empirical research. To make this start easier I have created a GitHub repository with information and materials on how to get started with Python for your own research projects. These materials are created and selected based on my own experiences and preferences as a Python enthusiast and aspiring Accounting researcher. The repository is available here:

Brief overview of the repository.

The first part of the repository consists of the readme which gives information on the practical aspects of getting started with Python. More specifically it contains the following sections:

1. How to get your Python setup ready (Link)
2. How to use Python and specifically the Jupyter Notebook (Link)
3. How to install Python packages (Link)

The second part of the repository is a collection of Jupyter Notebooks introducing the Python syntax and demonstrating how to use Python for various data science tasks:

1. An introduction to the Python syntax (Link)
2. Examples on how to open a variety of file types (txt, csv, excel, stata, sas, json, hdf) (Link)
3. A comprehensive overview on how to use the Pandas library for data cleaning / wrangling (Link)
4. Examples on how to generate visualizations with Pandas, Seaborn, and Bokeh (Link)
5. A comprehensive overview on how to use Python for APIs and web scraping (Link)

I have also created an extensive Jupyter Notebook on using Python for Natural Language Processing.

This notebook is available in a separate GitHub repository:

Where to start?

If you are starting from scratch I recommend the following:

1. Familiarize yourself with the Getting your Python setup ready and Using Python sections (Link)
2. Check the Code along! section to make sure that you can interactively use the Jupyter Notebooks
3. Work through the 0_python_basics.ipynb notebook and try to get a basics grasp on the Python syntax
4. Do the "Basic Python tasks" part of the exercises.ipynb notebook
5. Work through the 1_opening_files.ipynb, 2_handling_data.ipynb, and 3_visualizing_data.ipynb notebooks.

Note: the 2_handling_data.ipynb notebook is very comprehensive, feel free to skip the more advanced parts at first.

6. Do the "Data handling tasks (+ some plotting)" part of the exercises.ipynb notebook

If you are interested in web-scraping:

7. Work through the 4_web_scraping.ipynb notebook
8. Do the "Web scraping" part of the exercises.ipynb notebook

If you are interested in Natural Language Processing:

9. Look at the NLP_notebook.ipynb in my Python NLP Tutorial repository (Link)

Tip: try out the notebooks using Binder

It is possible to try out the notebooks (and actually run the examples) without installing anything by clicking the Binder button below: