Ir al contenido principal

TFG/Bacherlor's Thesis: Analysis and performance of Machine Learning for Startup Valuation

As I've said many times, I've been a nerd all my life, so I decided to cheat my Bachelor's degree and do to things:
1) Take some IT courses in the USA
2) Write my Bacherlor's Thesis about Machine Learning, code more than use finance knowledge

A short introduction on why:
Two of the most important characteristics of Startups are uncertainty (Neumann, 2019) and the absence of quantitative information, the main topics of this work. Both components damage the financing possibilities of the company, especially when they fail to raise suficient funds from Business Angels or investment rounds (in the US only 0.96% obtain funds from Venture Capital or Business Angels (Entis, 2013)). As explained before, small and medium investors rarely have information about the financial statements, key technologies and audit reports of a startup, etc., the opposite of what happens with a company with a solid trajectory and, especially, is listed on stock markets. Therefore, efforts should be made towards the use of modern data analysis techniques, mainly when we speak of qualitative variables, such as Machine Learning or the several types of Neural Networks.
The original idea was to use Neural Networks for finding the relantionship between several variables that may influence startup success and also to provide some time-based predictions. None of this happend due to lack of time.
Instead, what I got was a binary prediction on Startup success (yes or not) using Machine Learning (Logistic Regression) and the degree of importance of the most interesting variables using Random Trees. Furthermore, the other important side of the project was to shed light on the topic "AI for startup valuation", so I did a compressed but clear summary of the most significant papers on the topic starting from the 80s to 2020.

The most difficult part was finding a diverse and big dataset, that I finally got from Crunchbase (Snapshot of 2013), which to have to ask for access, but it's free to use for non-comerical purposes.
This database offers more than 500,000 startups but after cleaning I ended up with only 20,448 startups, that had a damn class imbalance (close startups, the minority class, only 6.79% of the sample).
I used ADASYN for solving this problem, it creates artificial datapoints to solve class imbalance; finally got 34,227 startups. All the process is described in the paper and I think the code is well documented.

Of course, I used Python for all of this, and nope, not Jupyter Notebook, but a PyCharm project instead.
And yes! You can access the code in Github, and download the paper here

The results were interesting, demonstrating the ease of access and set up of the latest AI technoologies. But that wasn't the point, at least not the most interesting. The model showed a precision of  ~94%, which is pretty good, 94% of accuracy when the model said "hey that a Startup will be succesful". The counterpart of that was a high false positive rate on closed startups, which creates an opportunity cost problem, an investor will miss some great opportunities.
Beyond that, the uselfuness of the model drops when it comes to time-based predicitions. When will startup X be skyrocketing? Will startup Y follow an upward tendency before it goes bankrupt? Those ones are difficult, but the most important, how succesfull will be startup Z?
 




Comentarios