In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:
Avoid IO disk at all.
Use buff/cache properly if IO disk couldn’t be avoided.
Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.
In this post we are going to put together network and processing operations to see the improvement in a complete workflow.
In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:
Avoiding IO at all was 11,3 times faster
Using buff/cache was almost 4 times faster
All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.
Several months ago I was asked to record a small video to Spread GNU Parallel. GNU Parallel is a fantastic tool, a Swiss army knife for process parallelization. With GNU Parallel you can:
Parallelize long boring pipelines with only a few extra lines of code.
This August I got my “Hitachi Vantara Certified Specialist – Pentaho Data Integration Implementation HCE-5920 Exam” certification. The badge can be checked by clicking in the image or in the link
In 2016 I finished a project for an EU institution to automate the generation of many reports. The data sources were diverse. eg: API’s, Databases, etc.
The reports were generated automatically in 15 minutes while the previous way was taking around 3 weeks as the report were being filled manually in an intermittent way: 2 hours today, 3 hours tomorrow, 5 hours next week.
During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.
Recently I have changed my backup solution from SpiderOak to Tresorit. I have been very happy with SpiderOak since I started with them around 2009, But last year backups and sync started to fail. Eg: backups taking ages or not finishing at all, etc. Also support response time was not good enough and didn’t find a proper fix for my problems, so finally I decided to move my business elsewhere. The chosen one was Tresorit, a Swiss based company that offered two things important for me de-duplication and client side encryption.
Both solutions works in Linux but Tresorit needs a GUI to work (SpiderOak support a headless mode). This was a problem as I wanted to run the Tresorit client in a headless VPS servers. To add a kind of pseudo headless support to the Tresorit client I decided to use the Xpra software a multi-platform (Microsoft Windows, Linux, Mac) screen and application forwarding system or as they say in the web page “screen for X11”. Continue reading →
During the last few weeks I have been interviewed for several DevOps positions. In two of them I had to reply a skills check-list and in the other one an exercise to be solved and send back by email. I think these check-list interviews are not good for DevOps positions, specially if the check-lists used are not updated properly. Let’s see why…
Sometimes you have to deal with servers that you don’t know anything about:
You are a short temp IT consultant with not previous knowledge on the environment.
The CMDB is out of order.
You are on a DR situation.
Or simply the main administrator is not there.
And you need:
Run commands in parallel
Get info from many servers at a time
Troubleshoot DNS problems
Check how many servers are up and running
On my systems I use two orchestrators: MCollective and SaltStack (configured automatically using puppet) that fulfill my needs. But let’s see a quick way to have an orchestrator in a rapid manner.
I have been working with DigitalOcean for several months, on average DigitalOcean deploys your VPS server in 55 seconds. After the server is deployed, all the manual/prone to errors/boring configuration process is needed.
As I am using puppet to configure all my servers I have create provisioningDO rakefile script (based on John Arundel’s book Puppet 3 Cookbook) to deploy and configure my servers in 4 min 15 sec. It means After 4 min 15 secs, my servers are ready for production.
It installs and configures knockd (a port knocking software). Continue reading →
Share
Usamos cookies para ofrecerte la mejor experiencia en nuestra web.
Haciendo clic en “Aceptar” das tu consetimiento para usar estas cookies.
Puedes realizar un consentimiento pormenorizado en \"Ajustes\".
Configurar y más información