KNect365 is part of the Knowledge and Networking Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.

Informa

The best python tools to analyze alternative investment data for $0

Alternative data such as credit card transaction and web data is becoming an integral part of the investment process. Building an alternative data effort can be expensive but fortunately unlike buying datasets and hiring people, the technology infrastructure required to analyze alternative data can be acquired at minimal to zero cost. Here, Norman Niemer, Chief Data Scientist at UBS Asset Management (QED)*, explores the best tools to anslyse alternative investment data.

These free tools will allow you to analyze alternative data without having to spend a lot, and can be used individually to perform a particular function or together to form a comprehensive technology stack to operate your entire data effort. The list walks you through the alternative data analysis process step-by-step. For each step, you have the key tools including links - the final product will give you the resources needed to analyze alternative data at a cost of $0!

Overview of the alternative data analysis process

  1. Get data: copy data from the vendor to where you want to analyse data
  2. Ingest data: you may need to manipulate raw data for it to be loaded correctly
  3. Load data: import data into the analysis package
  4. Preprocess data: clean, filter and transform data for it to be ready for modeling
  5. Modeling: apply your analysis, make predictions and draw conclusions
  6. Presenting: present your insights and conclusions in a digestible format
Step Goal Tool
Get
AWS S3 file download S3 browser (graphical)AWS CLI (programmatic)
FTP file download Winscp (graphical)Pyftpsync (programmatic)
Vendor API data download Postman (graphical)Requests (programmatic)
Corporate proxy authentication Cntlm
Manage data pipelines Luigi / Airflow / d6tpipe
Ingest & Load
Read CSV, Text an Excel files directly without ingesting Pandas / Dask / Pyspark
Quickly ingest any type of CSV or Excel data d6tstack
Store data in a database Postgres / Mysql / Others (database engines)SQuirrel / HeidiSQL (GUI tools)
Preprocess
Prepare data to fit into prediction model Pandas / Dask / Pyspark
Join datasets even if they don't perfect match d6tjoin
Statistical data vizualisation seaborn
Handle missing values fancyimpute
Modelling
Basic statistical analysis Pandas
Advanced statistical analysis Statsmodels
 Machine Learning  Sklearn
Bayesian modelling in python Pymc3
Probabilistic modelling in python Pomegranate
Presentation
Summarize your results in static charts Pandas visualization
Interactive reports and dashboards Rstudio markdown (reports, includes python support)Rshiny (dashboards)
Interactive visualizations Plot.ly
Interactive visualizations, alternative Bokeh
Turn jupyter notebooks into slideshows Jupyter present

 

Step 1. Getting Data

Why should I care?

So you’ve found a dataset on alternativedata.org, the vendor gave you access and you are ready to leverage alternative data! The very first step to analyzing alternative data is to get your hands on the data.

Help, I’m a total beginner! How how do I install python and how do I learn how to use it?

We will be using python a lot for many tasks. If you’ve never used python before, we recommend you install Anaconda python and learn to code in python using many of the excellent free/affordable resources such as codecademy and Coursera. Note that for getting data you can get by without python if need be but you will definitely need it later on when you want to process and analyze data. For additional details see alternativedata.org step 1.

Step 2 & 3. Ingesting and Loading Data

Why should I care?

In step 1 you obtained data from vendors for you to do your analysis. The data may arrive in various formats and you may need to manipulate the raw data for it to be loaded correctly.

What are pandas, dask, pyspark? How do I choose between them?

Those are core python data analytics libraries. They all have some pros/cons, mostly related to the size of data you are analyzing. An overview of each is given below. Also note that there are graphical and commercial alternatives available for data preparation, data integration and data analysis.

Pandas

  • Overview: THE python data analysis library. If you’re just getting started, stick with this it take you a long way
  • Pros: Widely used with lots of support and many extensions available
  • Cons: Likely to break when you start analyzing >=1GB files

Dask

  • Overview: built on top of pandas to make it scale for large datasets. The next best thing after you’ve pushed pandas to its limits
  • Pros: Good for analyzing large files and intermediate multi-core/distributed computing
  • Cons: Missing some pandas functionality and works slightly differently from pandas

Pyspark

  • Overview: once you start analyzing really large datasets TB+ datasets pyspark gives you fully distributed computing
  • Pros: Scales really well and integrated into Hadoop ecosystem
  • Cons: Steeper learning curve and infrastructure setup and more expensive

For additional details see alternativedata.org step 2-3.

Step 4. Preprocess: clean, filter and transform data for it to be ready for modelling

Why should I care?

Everybody knows - or should know - the maxim of data analysis: garbage in, garbage out! Before you start feeding data into a prediction model, you need to explore the data and likely clean and transform it for the data to be appropriate to feed into a prediction model.

What are common problems and how can I solve them?

  • Look-ahead bias: for your model to work in practice you can't be making predictions with data that was not available at the time of decision. Say for example a company reports 2018-01-08 and an alt data provider has data for 2017-12-31 but the data gets published with a 2 week delay, you couldn't have used it for making a prediction in your backtest. Use asof joins to address avoid this bias.
  • Missing values: They are obvious when present because they likely break your prediction code so you have to deal with them somehow. The simple solution is to just drop them but alternative datasets often have short histories and therefore simply dropping missing values can significantly impact model training and/or skew results. Inspect what you will be dropping and why it is missing, perhaps there is a fix or try to fill them intelligently
  • Outliers: unlike missing values they likely won’t break your prediction library but skew your predictions without you even noticing! Alternative datasets often have short histories, combining that with noisy financial data means you should carefully inspect your prediction input data
  • Stationarity: prediction models often don’t produce good results when you data is not “stationary”. Practically that means you want to model changes instead of levels, for example it’s better to model sales growth than it is to model sales
  • Seasonality: especially consumer companies have high seasonality, eg over Christmas. If possible work with yearly changes instead of sequential changes

Common preprocessing operations

You’re really getting into the weeds now with the data analysis libraries. To get you started we suggest working with pandas, here are some resources:

See documentation and the pandas book for full details.

Combining datasets

Almost always you need to combine different datasets to make a prediction, eg vendor data with financial statement data or multiple vendors. There are three types of joins:

  • Exact joins: say you have a dataframe with data from a vendor but want to analyze it by industry but the vendor didn't include industry data. You can merge another dataset with industry data on company ticker using pd.merge() if the ticker format matches exactly
  • Fuzzy joins: say you want to combine data from two vendors but they use different ticker (“ABC US Equity” vs “ABC-US”) or date conventions (business vs calendar days). To find the closest match, you can use d6join top1 with a variety of similarity functions
  • Asof join: say you have a set of company reported kpis with announcement dates and want to know the last known data point before announcement, use pd.merge_asof(direction : ‘backward’). Alternatively say you want to know the first trading price after some announcement including weekends etc, use pd.merge_asof(direction : ‘forward’)

Many things can go wrong when joining datasets, we suggest your use d6tjoin prejoin prior to running joins to assess join quality and analyze join problems.

Step 5. Modelling: apply your analysis, make predictions and draw conclusions

Why should I care?

Finally! This is the step where you predict an important investment metric, such as returns or financials using alternative data. Because you’ve diligently ingested and cleaned the data, you’ve maximized your chances of making meaningful and accurate forecasts.

What are some unique challenges modeling alternative datasets and how do I address them?

  • Short history: since many of those datasets just weren’t available before, they have a short history. Use simple models on a small set of predictors to avoid overfitting and improve out-sample performance.
  • Limited breadth: quants might be used to datasets that cover 1000+ names. Just because a dataset has limited breadth doesn’t mean it can’t be really valuable
  • Lots of datasets: with more datasets coming out every day, it is tempting to try to throw everything in and see what happens. This might work well when modeling but often leads to bad decisions in the future. Use cross-validation, point-in-time out-sample testing and your economic intuition to avoid overfitting and making bad investment decisions in the future

Step 6. Presenting: present your insights and conclusions in a digestible format

Why should I care?

Many senior stakeholders, as well as fundamental investment analysts, prefer simple messages with a key insight backed up by data. They might also want to interact with the data in an intuitive way. Less is more here, follow Einstein’s advice: “Everything should be made as simple as possible (, but not simpler)”.

Next Steps: Sample Code and Tutorials: Additional resources including sample code and hands-on tutorials are available on github.

*Disclaimer: These materials, and any other information or data conveyed in connection with these materials, is intended for informational purposes only. Under no circumstances are these materials, or any information or data conveyed in connection with such report, to be considered an offer or solicitation of an offer to buy or sell any securities of any company. Nor may these materials, or any information or data conveyed in connection with such report, be relied on in any manner as legal, tax or investment advice. The facts and opinions presented are those of the author only and not official opinions of UBS.

Get articles like this by email