It seems that these days everyone wants to be a data scientist – or employ a team of them. There is little doubt that rigorous analysis of big data can pay dividends in the financial world, but the age-old problems of data quality and false correlations are not going away, as Silvia Pavoni reports.

Rise of the data scientist

Data science has become a buzz phrase in investment circles, and ‘data scientist’ a sought-after job title. Funds are investing big in the profession. The ability to manipulate vast pools of information accruing in real time, in ways unheard of just a couple of decades ago, has grown in kudos: it has been called a revolution, the ‘new oil’, something that can reveal new depths of understanding and unlock alpha, the market-beating returns that funds seek for their clients.

Furthermore, attempting to extract value from new kinds of data, such as Twitter feeds, individuals’ web searches or satellite images of car parks, crops or ships – a world away from the standardised and well-understood sources of financial information – are making data science look more like alchemy.  

Replacing humans

Data science is a clear match for quantitative funds, where investment decisions are taken by numerical methods. The now seemingly out-of-favour active funds – which select smaller amounts of stocks mostly through human judgement – are recruiting data experts to aid their investment process. In some cases, funds are replacing this process altogether. News earlier this year that the world’s largest asset manager, BlackRock, had got rid of several prominent stockpicking managers to switch their funds to quantitative strategies made clear just how crucial big data and data science are to the industry. 

One experienced investor, however, finds this trend exasperating. “[Data science] is a new tool, but so was Excel,” he says. “So was the HP calculator; sending email was a huge new tool. This happens all the time.”

There is also a sense that the buzz is intensified by excessively generous definitions. Insiders feel that some in the industry may be more concerned with the terminology used to court clients rather than with its actual deployment. “Lots of firms like the idea they’re using big data, but I’d argue that for some it’s more marketing than actual usage of,” says Ryan Stever, head of quantitative global macro research at Acadian, a US fund with $77bn under management.

“For a fundamental quantitative manager such as ourselves, I think that there’s promise here. But I think a lot of firms may be using lots of data mixed with human judgement and I think there are still a lot of managers doing quantitative investing successfully where the core of their process is based on more classical data, [such as that from] Thomson Reuters,” he adds.

Barry Hurewitz, chief operating officer of UBS’s investment research division, adds: “The term big data is way overused and in some ways not the point. Yes, there are a lot of new and interesting sources of data, but they are largely available to everyone. The edge will come from the creativity in how you analyse the data to draw relevant investment conclusions not obvious to others and [not] priced into the security. It’s about the analysis, not the data per se.” 

Apply the filters

Data scientists might not entirely replace analysts and stockpickers, but their work is undoubtedly transformative as they apply rigour to big data analysis. Indeed, the rarer the data, the sharper the edge. Acadian’s partnership with Microsoft, for example, goes in this direction: turning masses of chaotic web search information into investment ideas starts with methods to clean, filter and analyse such data. When someone searches on ‘apple’, how do investors know if that is the technology company or the fruit? And when there is a spike in job searches in a specific industry, how can they tell if this is because the sector is booming or because employees are being laid off?

If, as Mr Hurewitz notes, most data is available to everyone, the challenge is finding true insights. Curiously, buying and selling proprietary data has changed what constitutes insider trading, because trading on a company’s data sold by a third party does not break the rules, while trading on the same data sold by an employee of such a company does. The widely reported case of Quanton Data is indicative. The research firm, now part of Guidepoint, came across a mobile advertising company that also collected data on the type of device someone was using when the ad was being displayed, which helped it lucratively to estimate iPhone sales ahead of Apple’s 2011 and 2012 announcements.

Mostly, convincing businesses to sell valuable information they hold as part of their operations is a struggle. It is a new concept for most companies, who fear that by selling it they could be breaking market rules or unintentionally divulging proprietary information regarding their own activities, or who simply may not have considered it as a revenue stream and therefore have no one within the organisation who can deal with the request.

Looking to buy

All of this is a challenge for both investment firms and the data providers scouting for new sources and pre-packaging information sets for them. “Absolutely the hardest thing about being in this business is getting the data,” says Tammer Kamel, CEO and founder of data provider Quandl, which has recently added an astrophysicist to its team. “Selling the data is easy: if you have interesting data there is no shortage of hedge funds that will pay for it. [But] companies treat their data as a precious proprietary asset, and there has to be willingness at management level to monetise it. It’s still a modern concept.”

This is what Mr Kamel has been focusing on. Over the past year, he has managed to increase the number of alternative datasets on Quandl platform from two to 30. “We devote most of our energy to bringing in data from the wilderness, outside the normal boundaries of Wall Street,” he says.

Fascination with wild data is mirrored by the often misunderstood work of statisticians dissecting it. The term ‘data scientist’ was coined in 2008, according to Harvard Business Review, and its rise was propelled by the explosion of big data. Generally, data is considered big when typical database software tools are unable to capture, store, manage and analyse it. Practically, it means the 1 million transactions Wal-Mart records every hour, or billions of consumer web clicks. As computer power grows, so does the size of the datasets that are considered ‘big’. And as automatic capture abilities develop, so does the importance of big data and of the scientists able to extract value from it.

David Hand, emeritus professor of mathematics and former chair in statistics at Imperial College London, recalls an old joke reducing statisticians’ work to mostly cleaning data. This still holds true, he says, but the difference is the importance placed on that activity by the investment industry (among others), particularly as datasets continue to expand and alternative sources are explored.

“Because of the publicity attracted by data science, people have recognised that the ability to extract valuable data from the raw data, with all its potential issues, is a very valuable skill,” he says.

Indeed, Mr Hand also serves as chief scientific adviser to Winton Capital Management, a London-based quantitative fund with assets under management of $30bn, where data quality, he points out, is taken very seriously. “At Winton we have a lot of people whose task it is [to clean data]. It has a very good reputation, it’s a recognised career path,” he says. “The bottom line is that one is seeking only statistical relationships, correlations. If one gets one’s predictions right just 1% more than chance, then one can make good use of them in investment decisions; [but] it’s easy to be misled by poor-quality data or distorted samples when correlations are weak.”

Fame and caution

Data scientists approach the field – and their new-found fame – with excitement as much as with caution. Their task requires a broad skill set including information theory, computer science, signal processing, statistics and mathematics. More than anyone, they are familiar with the perils of alternative data such as unstructured text, images or audio files. Bringing in specialists from natural sciences, such as physics or biology, highlights the contrast with the social sciences of finance and economics, where hypotheses can lead to positive outcomes without necessarily being valid or replicable in the future.

“The interesting thing about finance, which we find every day, is the fact that it’s easy to be accidentally right,” says Jean-Philippe Bouchaud, chairman of Capital Fund Management, a leading French quantitative fund. “This creates a very strong bias in everything we [as an industry] try to do, gazing at the quality of a new investment idea.” Mr Bouchaud created the fund’s research division more than two decades ago, and is also professor of physics at the École Polytechnique. He is particularly vigilant when tackling new sources of information. “When the data is not stationary, when things evolve all the time, for us it is difficult to distinguish between a gamble and science,” he says.

Alternative data is challenging the rigorous application of scientific methods as it often lacks standard values as well as historical values. In the case of satellite images, say of a shopping centre's car park that an investor wants to monitor to predict its success, there may be issues with the image resolution not being good enough to understand if the cars parked in each slot have changed during the day. Social media data is highly unstable, with comments broadcast at changing frequencies and of the most disparate nature and tone, making it hard to classify. It is also very noisy, in statistical terms, meaning that finding and validating a signal is hard.

Quality control

Filtering techniques have long been applied in finance to reduce such noise, according to Damiano Brigo, chair of mathematical finance and co-head of the mathematical finance research group and stochastic analysis group at Imperial College. These techniques were already used in the first moon landing in 1969, to filter out noise and errors from combined radar readings of space modules. Their application to alternative data has led to mixed conclusions.

“Remember the debate on trading signals in Twitter feeds a couple of years ago? The discussion focused on how noisy the feed was, and on how often Twitter followed the signal, rather than anticipating it,” says Mr Brigo. “Results and techniques in this domain are hard to know given the obvious commercial sensitivity of this research.”

But bad investment calls can simply come from bad data, as Mr Hand defines it, which can be the result of human error or automated systems. It encompasses anything from the trivial, such as numbers written down incorrectly, missing digits or asynchronous time series that are not matched properly, to selection bias when using social media sources, because reactions and views voiced there can hardly be grouped in representative samples of the population. “Data quality is absolutely fundamental. I can give you countless examples of misleading conclusions having being drawn because of some aspects of errors in the data,” says Mr Hand. 

Missing links?

Inaccurate conclusions can also result from spurious correlation. When data sets are so vast, it is almost impossible to avoid the discovery of some sort of correlation between two factors, but whether this constitutes an investor signal is another matter. Adapting a model and making it unnecessarily complex to explain abnormalities in the data, or ‘overfitting’, is also a risk.

“The more data you have, the bigger the chance to find a relationship between past data and future behaviour,” says Mr Bouchaud. “Of course, having more data is extremely exciting, because it allows us to understand better how the world is functioning, how markets are functioning. But by the same token, having more data increases tremendously this risk of overfitting and inventing relationships that held in the past but have no reason to continue working.”

Mr Stever adds: “With the explosion of big data, correlation is pretty easy to find: if you’re not finding a correlation you’re not looking hard enough. The question is: does it repeat itself in the future? That’s what we care about in investing. And to find that, we really need to rely on investment theses, why something would correlate with returns.”

Examples of inexplicable relations between data points are as copious as they are colourful; many do not need a financial professional to be identified as lacking value. For example, Mr Hand points out that before a hurricane, sales of strawberry Pop-Tarts increase sevenfold, and that, according to the tylervigen.com blog on spurious correlations, the divorce rate in Maine matches the US’s per-capita consumption of margarine.

The bizarre risks of analysing big data reinforce Mr Hurewitz's defence of human judgement in investment decisions. He believes that active investors’ concerns drive securities prices and that data models based on past assumptions would quickly become out of tune if circumstances change. "Investor debates can shift and data can become less relevant or irrelevant to what’s moving [price] over time," he says. "If you’re thinking about investing in a company, and all of a sudden you have an issue such as Brexit or Donald Trump or a new entrant comes into the market, a merger is announced, now the stock price is moving because of a different uncertainty than before. So the model that you’ve built, that is trying to offer signal, may not be what the market is focused on."

Real-time results

Fund managers point to a further challenge, which is the ability to measure the direct impact of data science on returns, and therefore prove its worth. These same managers are secretive about the details of their research methods and sources, or even the size of their data science teams.

Despite the relative infancy of the field, there are some interesting links already being created at the intersection of big data and its real-time capture and statistical analysis. Data from insurance companies, for example, translates into instant insights into the auto market because each car insurance contract reflects a car purchase. Or weather app data can reveal precious information on consumer patterns. To supply climate information, the app needs the exact location of its user, which therefore reveals their proximity to stores and helps measure foot traffic in shopping malls.

Quandl is finalising an agreement with one such provider. “This company has 18 billion requests a day for the weather,” says Mr Kamel. “This means it knows where many hundreds of people are many times a day, which means you can track the physical location of these people. That’s valuable for knowing consumer trends. How many people went to Wal-Mart today, how many people went to Best Buy?”

Even the disorderly nature of social media may in time improve as the history of data sets grows in line with the chances of understanding how the mass of unruly information moves and if it holds investment signals. The more data becomes available, stored and analysed, the sharper investment ideas will likely become. Further, investment assumptions will become more accurate. The same is true for fields such as economics. The Massachusetts Institute of Technology’s Billion Prices Project, for example, trawls the internet to create daily inflation indices based on online prices and to conduct research. The benefits of this are self-evident.

“All this data is helping economics and finance to become sciences in the sense that physics is a science, because you can test hypotheses, you can understand the mechanism by which something influences something else,” says Mr Bouchaud. “Economics will probably change nature because of this in the next 10 to 20 years. That’s very exciting for investors but also for society.” From an often overlooked bunch to investment trend-setters, data scientists might, in the future, be able to pin down the laws of economics and finance. This could be their biggest achievement of all.

PLEASE ENTER YOUR DETAILS TO WATCH THIS VIDEO

All fields are mandatory

The Banker is a service from the Financial Times. The Financial Times Ltd takes your privacy seriously.

Choose how you want us to contact you.

Invites and Offers from The Banker

Receive exclusive personalised event invitations, carefully curated offers and promotions from The Banker



For more information about how we use your data, please refer to our privacy and cookie policies.

Terms and conditions

Join our community

The Banker on Twitter