The Web Scraping

by Nicolas Sacchetti

Web scraping is used to harvest information from websites. With the help of the collected data, one can predict the future financial problems of a company. Josep Domenech is an associate professor in the Department of Economics and Social Sciences at the Polytechnic University of Valencia. As a researcher, Prof Domenech is interested in web scraping and the digital economy. He presents the methodology of web scraping.

The event took place via videoconference at the P4IE Congress on Policies, Processes and Practices for Performance of Innovation Ecosystem presented by 4POINT0 from May 11 to 13, 2021. 

SMEs are the backbone of the European Union (EU). They represent 99% of the continent's businesses, according to the European Commission in 2008 and the Small Business Act of the same year. However, they need subsidies to be competitive when they are in their infancy, with limited credit.

To counteract the problem of survey time to establish the economic value of an SME, the 2022 study Predicting SME's Default: Are Their Websites Informative?, by Josep Domenech and his colleagues Lisa Crosato and Caterina Liberati, proposes a new approach. Using data harvested from their website, web scraping. This information available online allows for near real-time monitoring of SMEs, automatically detecting information relevant to economic analysis.

Prof Domenech's initial intuition was that companies' websites constitute a wealth of information relevant to predicting their economic value: "The results showed that changes in website content clearly reflect the status of the company. Active SMEs were mainly associated with updated websites, while inactive ones were more associated with closed ones. In fact, the results confirmed that the risk of business closure increases when website activity decreases."

Mathematics

Multi-period logistic regressions and statistical survival analyses were applied to study business activity status and understand how website status is related to business survival. Web scraping also uses the Fisher's Kernel Discriminant Analysis (KDA) model and other complex mathematical formulas.

Of course, a website with low activity and using outdated technology (e.g. Adobe Flash Player) decreases its default economic value prediction: "This is interpreted as the company is getting weaker," explains Prof Domenech.

The information presented on the website is extracted and classified into three levels to make the prediction. The content of the site, the HTML code, and the reaction of the server. Also, the machine learning approach allows cross validation of the classifications using the train set and the test set.

Variables

The variables used to make the economic prediction of firm failure are as follows in the study of Domenech et al. The offline variables selected as predictors are the number of employees, the year of activity, the percentage of debt, productivity and economic profit, in accordance with the classical theory.

For the online variables, websites were accessed through the Internet Archive's Wayback Machine, which is a digital library of websites capable of showing the appearance and characteristics of a site and its changes over time. This shows whether the site was offline for a while, and the extent of the changes that have taken place over time.

Training Data

To train the model that the AI relies on to learn, you first need to gather data. It is best to have a balanced sample of companies that pass the successful company test and those that fail it, to prevent computational bias, a lack of neutrality. To deal with sampling imbalances, Prof Domenech proposes five techniques:

Oversampling adds more data points to the defaulting firms in an effort to rebalance the samples. Default observations are repeated.

SMOTE (Synthetic Minority Oversampling Technique) generates synthetic data between each sample in the minority class.

"When there are too many fluctuations about a few variables, say the number of block headers on one website is 1, and on another it's 1 000, then what to do with that much variability," asks Domenech. He informs that converting website characteristics into binary variables tempers this situation.

On the other hand, when there are too many variables, the statistical regression method LASSO (Least Absolute Shrinkage and Selection Operator) "performs both variable selection and regularization to reduce generalization errors." 

Dimensionality Reduction (DR) is a process used when one wants to analyze or organize data that is too high in dimensions. DR can be achieved through a Multiple Correspondence Analysis (MCA). This is "an extension of correlation analysis that allows the analysis of the pattern of relationships of several categorical dependent variables." Kernel Fisher's Nonlinear Classification Discriminant Analysis (KDA's solution) can also be useful to make the reduction.

Josep Domenech also invites you to question the relevance of doing a web scraping of the entire site. In some cases, the home page is sufficient. The prediction can be made on different time scales.

Test Data

The evaluation of the approach is done by multiple repetitions of the training and testing procedures (k-Fold Cross-Validation). "The indicators are based on the criteria of sampling balance accuracy, oversampling sensitivity, and SMOTE specificity," Domenech explains. Banks are more interested in the sensitivity criterion, as they want to know the proportion of company failures that have been detected.

Finally, Josep Domenech concludes with these words: "Websites are a rich source of information that has yet to be exploited. Specific techniques must be used to transform them into usable data sources."

Related articles

Pierre-Samuel Dubé of Irosoft Legal presents : The Efficiency of AI Assisting the Legal Field — Benjamin Zweig from Revelio Labs explains how creating: The Jobs Taxonomy — Jean-François Connolly represents IVADO : Tools to Undertake Data Analysis

This content has been updated on 2023-01-31 at 4 h 10 min.