Introduction :
A friend of mine started a little Python crypto trading app and shared the sources with me.
But I'm not that comfortable with Python.
See, I'm a Business Intelligence Engineer, meaning my usual tools are ETL/ELT, Databases, (advanced) SQL, and reporting tools. I used to be proficient with Talend Open Studio for Data Integration (ETL), learned Stambia 3 years ago, and now (am obliged to) use a tool developped by my company that relies on pure SQL script. Yeah, that's a major step backward, but hey, I don't get to decide.
I did a bit of machine learning as a hobby, just to keep in touch with the new trends of my field, and plan to put my little knowledge to use.
I tried to build on the existing python script, but I miss so much training in python than even the simplest thing required a lot of research on the web. I gave some thoughts about it and finally decided that I could do it faster using some tools I already know.
Don't get it wrong, I'll learn a lot of new things : I don't know everything about Talend and I'll get out my comfort area. I never really used API or JSON with Talend. So, it's a good way to stay knowledgable about the tool and also to improve my skills further. Same for the other technologies I'll use: the idea is to stay in the warm and comfy place that I enjoy, but continue exploring new stuff from there !
Part 1 : Retrieving the data
Overview
I'll be using Talend Open Studio for Data Integration Talend Open Studio for Data Integration. It's a free and very powerfull ETL (Extract Transform Load) that completely suits a data retrieval process.
I won't get you through the steps of installing and the basics of Talend. Maybe, I'll make a post about it later if you are interested.
Here is my first version of the data retrieval process :
As you can see, it's very simple :
- TFileSymbolList reads a file holding a list of symbols or token pairs (eg. ALPHABUSD )
- tIterateSymbols execute the following steps for each line in the input file
- tJavaExposeSymbol is a trick to feed the REST component with the current symbol
- tRestBinanceCandlestick calls the Binance API and retrieve the klines data
- tMapFormatJson slightly rework the API answer
- tExtractJsonField_1 extract the data from the API anwser and give a row per record and split the columns.
- TFileSaveCandlesticks write a csv file for each symbol
Metrics :
For 51 pairs, the Python script took around 30 seconds.
The Talend job needs the same amount of time.
But here's the trick : I just need a click to enable the Multi thread execution in the job Extra configuration.
Then I just set the number of parallel executions on the Iterate link.
- With 5 threads, it goes down to 11s
- With 10 threads, down to 4.3s
- With 16 threads (1 per logical processor), down to 3s
- With 55 threads, down to 2.65s
As we can see, 10 or 16 threads seems to be a good choice. Further increase in threads does not really improve treatment time.
Configuration - In Depth
For those of you who are interested in the component configuration, here are some details.
tJavaExposeSymbol
A tJavaFlex is a component allowing you to write Java code to handle you flow.
Start Code will be executed BEFORE looping on rows.
Main code will be executed for each row.
End Code will be executed AFTER the loop.
I just need the "Main Code" as I just want to transfer the current symbol to the REST client.
As you can see, the orange line going out of "tJavaExposeSymbol" is named row4. I declared a column named "symbol" in its schema.
On the other hand, tIterateSymbol publishes a global variable, name "symbolList.symbol". The code simply looks like this :
row4.symbol = ((String)globalMap.get("symbolList.symbol"));
tRESTBinanceCandlestick
The component is a tRESTClient.
I need to call "https://api.binance.com/api/v3/klines". So the base URL is "https://api.binance.com/api/v3/" and the Relative Path is "klines". I pass 2 parameters : interval, and symbol.
See Binance API Doc
For Binance, I also need to authenticate with a simple API key, inside the header.
The name of the header field is "X-MBX-APIKEY".
tMapFormatJson
A tMap is used to map field from one or more input source to one or more output source. You can makes SQL-like joins, and many field operations with all the power of Java.
I used it for a very simple mapping : I only retrieve the field "string" (yeah, bad naming!) which holds the API response, and add a field with the current symbol.
Note that I wrapped the response with a curly braces and added a name to the response (myList)
tExtractJSONFields_1
tExtractJSONFields component are used to transform a JSON field into a list of rows and columns (aka "data flattening")
This JSON is very simple : it's just a list of rows containing a list of columns looking like this :
[
[
1499040000000, // Open time
"0.01634790", // Open
"0.80000000", // High
"0.01575800", // Low
"0.01577100", // Close
"148976.11427815", // Volume
1499644799999, // Close time
"2434.19055334", // Quote asset volume
308, // Number of trades
"1756.87402397", // Taker buy base asset volume
"28.46694368", // Taker buy quote asset volume
"17928899.62484339" // Ignore.
]
]
(example from Binance API Doc)
Talend won't accept a field without curly braces. That's why we added them in the tMap, giving something like this :
{
myList :
[
[ row1 ], [ row2 ], [ ... ]
]
}
The component can now be configured :
The Loop element is "myList" so the Json Path query is "$.myList.[*]".
Then each column is posititional : "@[position]"
Notice that "symbol" comes from the previous component, and does not have any "Json Query" parameter.
tFileSaveCandlesticks
This component is a very simple "tFileOutputDelimited". It's your best friend when you need to store data locally and/or have a look at the data. Configuration is straightforward : just give the path for your filename.
You can configure the row separator, the field separator, text enclosure, escape character, encoding, thousand and decimal separators...
In the path we can use the same syntax as in the tMap and tJava to retrieve the symbol.
We will end up with one file per symbol data.
Ending thoughts for part 1
Making this little Talend job took me less than an hour even if I never got to use tExtractJSONFields and tRESTClient and had to do some research. I just spent much more time to write this article 😅.
It is faster to process, and much easier for me to read and understand.
The next things I'll try to implement are :
- A configuration file : API Address and keys, interval, etc..
- Data storage on a database : I wanna try BigQuery with Talend
- Financial Indicators computation : I use finta but it's for Python. I'll have to come up with something :)
Tell me !
What you think of this article, what I did right, what I should do differently, what can be improved.
Could you be interested in ETL/ELT ? Should I write a little guide for newbies ?
And if somebody knows a good ressource about querying the Hive Blockchain, let me know : I'd be happy to try to access it from Talend !