Downloading data from the WRDS website is convenient but not the most transparent and replicable as it requires a workflow along the lines of:
The problem is that step 2 and step 3 happen outside of your code. This essentially leaves a gap in the steps required to run your code, which can leave other people (or your future-self) confused as to what you exactly did on the WRDS website to extract the required data.
For a long time WRDS only offered to programmatically interface with WRDS data through SAS files. Non-SAS clients (think Python and R) could therefore only interact with WRDS data through SAS-bindings. This approach technically worked but was definitely a substantial hassle for many applications. Fortunately, WRDS decided to also make all their data available through a series of PostgreSQL databases. This change happened somewhere in June / July of 2017 (at least for the WRDS Python package). In other words, we can now finally directly interface with WRDS data using tools like Python, R, Stata, and MATLAB!
The WRDS support documents extensively describe how interface with WRDS using the different programs. I will include all links below, but for the sake of completeness will discuss a couple of tips and tricks + some Python code examples to illustrate how one can use it.
Note, the WRDS
library can be installed from the command line using pip install wrds
https://pypi.python.org/pypi/wrds
import wrds
db = wrds.Connection()
libraries = db.list_libraries()
libraries[:4]
tables = db.list_tables('wrdssec')
tables[:4]
col_headers = db.describe_table(library='wrdssec', table='wrds_nlpsa')
col_headers.head()
There are two ways:
get_table()
raw_sql()
For a table with not too many rows (like some of the linktables) you can use get_table()
, otherwise raw_sql()
is advisable.
get_table()
¶data_5rows = db.get_table(library='wrdssec', table='wrds_nlpsa', columns = ['gvkey', 'cik', 'filename', 'finterms_litigious_count'], obs=5)
data_5rows
raw_sql()
¶data_5rows.cik.values
sql_query = """
SELECT gvkey,
cik,
filename,
finterms_litigious_count
FROM wrdssec.wrds_nlpsa
WHERE cik IN ('0000000003', '0000000003', '0000000003', '0000000003', '0000000003')
"""
data_query = db.raw_sql(sql_query)
data_query