Alternative data such as credit card transaction and web data is becoming an integral part of the investment process. Building an alternative data effort can be expensive but fortunately unlike buying datasets and hiring people, the technology infrastructure required to analyze alternative data can be acquired at minimal to zero cost. Here, Norman Niemer, Chief Data Scientist at UBS Asset Management (QED)*, explores the best tools to anslyse alternative investment data.
These free tools will allow you to analyze alternative data without having to spend a lot, and can be used individually to perform a particular function or together to form a comprehensive technology stack to operate your entire data effort. The list walks you through the alternative data analysis process step-by-step. For each step, you have the key tools including links - the final product will give you the resources needed to analyze alternative data at a cost of $0!
Overview of the alternative data analysis process
- Get data: copy data from the vendor to where you want to analyse data
- Ingest data: you may need to manipulate raw data for it to be loaded correctly
- Load data: import data into the analysis package
- Preprocess data: clean, filter and transform data for it to be ready for modeling
- Modeling: apply your analysis, make predictions and draw conclusions
- Presenting: present your insights and conclusions in a digestible format
|AWS S3 file download||S3 browser (graphical)AWS CLI (programmatic)|
|FTP file download||Winscp (graphical)Pyftpsync (programmatic)|
|Vendor API data download||Postman (graphical)Requests (programmatic)|
|Corporate proxy authentication||Cntlm|
|Manage data pipelines||Luigi / Airflow / d6tpipe|
|Ingest & Load|
|Read CSV, Text an Excel files directly without ingesting||Pandas / Dask / Pyspark|
|Quickly ingest any type of CSV or Excel data||d6tstack|
|Store data in a database||Postgres / Mysql / Others (database engines)SQuirrel / HeidiSQL (GUI tools)|
|Prepare data to fit into prediction model||Pandas / Dask / Pyspark|
|Join datasets even if they don't perfect match||d6tjoin|
|Statistical data vizualisation||seaborn|
|Handle missing values||fancyimpute|
|Basic statistical analysis||Pandas|
|Advanced statistical analysis||Statsmodels|
|Bayesian modelling in python||Pymc3|
|Probabilistic modelling in python||Pomegranate|
|Summarize your results in static charts||Pandas visualization|
|Interactive reports and dashboards||Rstudio markdown (reports, includes python support)Rshiny (dashboards)|
|Interactive visualizations, alternative||Bokeh|
|Turn jupyter notebooks into slideshows||Jupyter present|
Step 1. Getting Data
Why should I care?
So you’ve found a dataset on alternativedata.org, the vendor gave you access and you are ready to leverage alternative data! The very first step to analyzing alternative data is to get your hands on the data.
Help, I’m a total beginner! How how do I install python and how do I learn how to use it?
We will be using python a lot for many tasks. If you’ve never used python before, we recommend you install Anaconda python and learn to code in python using many of the excellent free/affordable resources such as codecademy and Coursera. Note that for getting data you can get by without python if need be but you will definitely need it later on when you want to process and analyze data. For additional details see alternativedata.org step 1.
Step 2 & 3. Ingesting and Loading Data
Why should I care?
In step 1 you obtained data from vendors for you to do your analysis. The data may arrive in various formats and you may need to manipulate the raw data for it to be loaded correctly.
What are pandas, dask, pyspark? How do I choose between them?
Those are core python data analytics libraries. They all have some pros/cons, mostly related to the size of data you are analyzing. An overview of each is given below. Also note that there are graphical and commercial alternatives available for data preparation, data integration and data analysis.
- Overview: THE python data analysis library. If you’re just getting started, stick with this it take you a long way
- Pros: Widely used with lots of support and many extensions available
- Cons: Likely to break when you start analyzing >=1GB files
- Overview: built on top of pandas to make it scale for large datasets. The next best thing after you’ve pushed pandas to its limits
- Pros: Good for analyzing large files and intermediate multi-core/distributed computing
- Cons: Missing some pandas functionality and works slightly differently from pandas
- Overview: once you start analyzing really large datasets TB+ datasets pyspark gives you fully distributed computing
- Pros: Scales really well and integrated into Hadoop ecosystem
- Cons: Steeper learning curve and infrastructure setup and more expensive
For additional details see alternativedata.org step 2-3.
Step 4. Preprocess: clean, filter and transform data for it to be ready for modelling
Why should I care?
Everybody knows - or should know - the maxim of data analysis: garbage in, garbage out! Before you start feeding data into a prediction model, you need to explore the data and likely clean and transform it for the data to be appropriate to feed into a prediction model.
What are common problems and how can I solve them?
- Look-ahead bias: for your model to work in practice you can't be making predictions with data that was not available at the time of decision. Say for example a company reports 2018-01-08 and an alt data provider has data for 2017-12-31 but the data gets published with a 2 week delay, you couldn't have used it for making a prediction in your backtest. Use asof joins to address avoid this bias.
- Missing values: They are obvious when present because they likely break your prediction code so you have to deal with them somehow. The simple solution is to just drop them but alternative datasets often have short histories and therefore simply dropping missing values can significantly impact model training and/or skew results. Inspect what you will be dropping and why it is missing, perhaps there is a fix or try to fill them intelligently
- Outliers: unlike missing values they likely won’t break your prediction library but skew your predictions without you even noticing! Alternative datasets often have short histories, combining that with noisy financial data means you should carefully inspect your prediction input data
- Stationarity: prediction models often don’t produce good results when you data is not “stationary”. Practically that means you want to model changes instead of levels, for example it’s better to model sales growth than it is to model sales
- Seasonality: especially consumer companies have high seasonality, eg over Christmas. If possible work with yearly changes instead of sequential changes
Common preprocessing operations
You’re really getting into the weeds now with the data analysis libraries. To get you started we suggest working with pandas, here are some resources:
- 10 Minutes to pandas: good overview into what pandas can do
- Indexing and Selecting Data: how to filter out the data you are interested in from raw data
- Group By: split-apply-combine: perform the same computation for different groups, eg calculate an industry mean or generate predictions for all companies in the vendor universe
- Merge, join, and concatenate: combine different datasets
- Reshaping: reshape your data, eg transpose from fiscal quarters going sideways to quarters going down
Almost always you need to combine different datasets to make a prediction, eg vendor data with financial statement data or multiple vendors. There are three types of joins:
- Exact joins: say you have a dataframe with data from a vendor but want to analyze it by industry but the vendor didn't include industry data. You can merge another dataset with industry data on company ticker using pd.merge() if the ticker format matches exactly
- Fuzzy joins: say you want to combine data from two vendors but they use different ticker (“ABC US Equity” vs “ABC-US”) or date conventions (business vs calendar days). To find the closest match, you can use d6join top1 with a variety of similarity functions
- Asof join: say you have a set of company reported kpis with announcement dates and want to know the last known data point before announcement, use pd.merge_asof(direction : ‘backward’). Alternatively say you want to know the first trading price after some announcement including weekends etc, use pd.merge_asof(direction : ‘forward’)
Many things can go wrong when joining datasets, we suggest your use d6tjoin prejoin prior to running joins to assess join quality and analyze join problems.
Step 5. Modelling: apply your analysis, make predictions and draw conclusions
Why should I care?
Finally! This is the step where you predict an important investment metric, such as returns or financials using alternative data. Because you’ve diligently ingested and cleaned the data, you’ve maximized your chances of making meaningful and accurate forecasts.
What are some unique challenges modeling alternative datasets and how do I address them?
- Short history: since many of those datasets just weren’t available before, they have a short history. Use simple models on a small set of predictors to avoid overfitting and improve out-sample performance.
- Limited breadth: quants might be used to datasets that cover 1000+ names. Just because a dataset has limited breadth doesn’t mean it can’t be really valuable
- Lots of datasets: with more datasets coming out every day, it is tempting to try to throw everything in and see what happens. This might work well when modeling but often leads to bad decisions in the future. Use cross-validation, point-in-time out-sample testing and your economic intuition to avoid overfitting and making bad investment decisions in the future
Step 6. Presenting: present your insights and conclusions in a digestible format
Why should I care?
Many senior stakeholders, as well as fundamental investment analysts, prefer simple messages with a key insight backed up by data. They might also want to interact with the data in an intuitive way. Less is more here, follow Einstein’s advice: “Everything should be made as simple as possible (, but not simpler)”.
Next Steps: Sample Code and Tutorials: Additional resources including sample code and hands-on tutorials are available on github.
*Disclaimer: These materials, and any other information or data conveyed in connection with these materials, is intended for informational purposes only. Under no circumstances are these materials, or any information or data conveyed in connection with such report, to be considered an offer or solicitation of an offer to buy or sell any securities of any company. Nor may these materials, or any information or data conveyed in connection with such report, be relied on in any manner as legal, tax or investment advice. The facts and opinions presented are those of the author only and not official opinions of UBS.