XClip

December 20th, 2009

I know I maybe late with this one but I recently found this utility and I think its pretty neat.
It reads from standard in, or from one or more files, and makes the data available as an X selection for pasting into X applications.

http://sourceforge.net/projects/xclip/

$ xclip < /path/to/file
[middle click]

Tuning 101

December 17th, 2009

Interrupts:

In software, an interrupt is an event that calls for a change in execution.
Interrupts are serviced by a set of processors. By adjusting the affinity setting of an interrupt we can determine on which processor the interrupt will run.

Threads:

Threads provide programs with the ability to run two or more tasks simultaneously.
Threads, like interrupts, can be manipulated through the affinity setting, which determines on which processor the thread will run.
It is also possible to set scheduling priority and scheduling policies to further control threads.

By manipulating interrupts and threads off and on to processors, you are able to indirectly manipulate the processors. This gives you greater control over scheduling and priorities and, subsequently, latency and determinism.

Marketdata Stats site Powered by Exegy!

November 24th, 2009

It is possible to monitor the overall message traffic in the US and Canada on the website http://www.marketdatapeaks.com. This web site is powered by a single Exegy TickerPlant appliance in a New York Savvis data center. At the time of this writing, a peak of 1.7 million messages per second were received (and successfully processed) on 6th November 2009.

Data Center Top-of-Rack Architecture Design

November 16th, 2009

This document examines the use of the top-of-rack (ToR) cabling and switching model for next-generation data center infrastructure. It explores current 10 Gigabit Ethernet cabling choices and provides a solution architecture based on ToR to address architectural challenges. Data center managers and facilities administrators will choose cabling architectures based on various factors. The ToR model offers a clear access-layer migration path to an optimized high-bandwidth network and cabling facilities architecture that features low capital and operating expenses and supports a “rack-and-roll” computer deployment model that increases business agility. The data center’s access layer, or equipment distribution area (EDA), presents the biggest challenge to managers as they choose a cabling architecture to support data center computer connectivity needs. The ToR network architecture and cabling model proposes the use of fiber as the backbone cabling to the rack, with copper and fiber media for server connectivity at the rack level.

 

 

PDF can be found here:
ToR Architecture White Paper

Alternative TCP stack for Linux

November 16th, 2009


I’m doing some independant research on various TCP stacks available for Linux and I came across these guys at http://www.fastsoft.com/products/

 

From some of the more technical white papers available on their site this stack seems to be some what based on the vegas implementation but highly optomized. As we look at connecting hosts via 10GBE this looks very attractive.They seem to have a working stack for Linux via lkm

.i.e.

# sysctl net.ipv4.tcp_available_congestion_control=ftcp

FAST TCP is an alternative congestion control algorithm in TCP.

It is designed for high speed data transfers over large distance, e.g., tens of gigabyte files across the Atlantic.

 

Our current implementation is in TCP on Linux platform, though the principles and design can be implemented in other contexts than TCP.See more information here: http://netlab.caltech.edu/FAST/

What Is a Shielded CPU?

November 2nd, 2009

A shielded CPU is dedicated to running a high-priority task and the interrupt(s) associated with that task. To create a shielded CPU, the operating system must provide the ability to set a CPU affinity for both processes and interrupts.
The 2.4 series of Linux has the ability to set CPU affinity for interrupts, and open-source patches are available that provide this capability for processes. (See “Kernel Korner: CPU Affinity”, LJ, July 2003).Because a shielded CPU does not run background tasks, a high-priority task on a shielded CPU never is prevented from responding to an interrupt because another task currently is executing inside of a critical section on that CPU.
Interrupts always execute at a priority higher than any task, and because they occur at unpredictable points in time, non-real-time interrupts can cause significant non-determinism in a process’ predicted execution time. A shielded CPU is not permitted to run interrupts unless the interrupt is one that a high-priority task on the shielded CPU is using.

The key benefit of the shielded CPU approach is that it allows a commodity operating system to be used for applications that have hard real-time deadlines. Commodity operating systems like UNIX or Linux provide a benefit for these applications because they have large numbers of programmers that are familiar with the programming API, and there is a rich set of development tools and other application software available for these operating systems.

Shielded CPUs can provide more deterministic performance because the overhead of the operating system is essentially offloaded onto a subset of CPUs in the system. A shielded CPU is therefore able to provide a more deterministic execution environment.

mdraid and the 200000k speed limit

October 31st, 2009

By default md-raid will limit its operations to 200000k/sec - which is plenty for most desktop and 2 - 3 disk machines, but when you have more than 3 - 4 disks and there is enough cpu and i/o bandwith available, it makes sense to increase that limit.

to find out what the limit on your machine is :

$ cat /proc/sys/dev/raid/speed_limit_max
200000

Setting it to something higher :

echo 500000 >/proc/sys/dev/raid/speed_limit_max

So whats a good speed to set ? That depends on what it is that you are looking to achieve, eg: if you dont mind max’ing out your hardware platform ( cpu / io / disks ) then set it to something very high, like 2000000. On the other hand, if you want to keep some cpu and io resources back from md-raid ( like when doing a raid-1 rebuild on a production machine ) you might want to actually lower it down a bit.

The three main issues to consider when working out a raid max speed :

Number of disks: for aggressive sync’s I tend to go with 50 - 70 M/sec per disk, so on a 4 disk system the 200000 number is mostly ok, but on a 8 or 12 disk system I’d look to make that much higher. For conservative rates, or when machine resources are required elsewhere as well, 10 - 12M/sec per disk.
Interface: What interface you use is also going to make a big difference. So consider the implications of using IDE / SATA / SCSI.
CPU: the raid jobs,specially when run for large disks or over many disks, will be fairly cpu intensive. So workout what sort of speeds work best for the loads you have. Usually this isnt something one needs to consider unless the machine is already under load or expected to be used during the raid operation. Over the last few years, AMD’s have been able to deliver slightly better throughputs than Intel’s - but in the recent past, much of that has changed. So dont just go with what you hear or opinions around the place : test it yourself.
Finally, while speed_limit_max sets the rates md-raid is going to try and reach, there is the speed_limit_min - which is the rate that md-raid will try and maintain as an ‘atleast’ limit. I tend to be a bit more conservative about that number. Usually aiming for 25 - 30 M/sec per disk for a very aggressive run. Or 10 - 15 M/sec for a more toned down run. If you have i/o intensive ops running on the machine you might need to reduce this even further - however the default of 1M/s for the whole machine, irrespective of disk count is something I feel too low for a modern machine.

I find many people are unaware of this small detail, hopefully this post will help.

MapReduce & Netezza

October 30th, 2009

I recently came across this article about how eHarmoney uses Netezza and HadOOP to match potential couples using HADOOP/mapreduce and very complex algos.

From the article:
“A giant Oracle 10G database spits out a few preliminary candidates immediately after a user signs up, to prime the pump, but the real matching work happens later, after eHarmony’s system scores and matches up answers to hundreds of questions from thousands of users. The process requires just under 1 billion calculations that are processed in a giant batch operation each day. These MapReduce operations execute in parallel on hundreds of computers and are orchestrated using software written to the open-source Hadoop software platform.”

Netezza is a computer hardware/software company, whose primary product is an MPP data warehouse appliance. The greatest distinguishing feature of Netezza technology is its reliance on FPGAs and PowerPC processors.

What is CUDA?

October 30th, 2009

The graphics cards that we use for gaming/visual enhancement has two basic components: a Graphics Processing Unit (GPU) and off-chip DRAM. GPUs are designed for compute intensive jobs, where CPUs are two slow. On the other hand CPUs are designed for data caching and controlling, where GPUs are useless.

GPUs in general have a highly parallel architecture and in particular some of NVIDIA’s GPUs have 240 cores per processor (compare this with modern CPUs: 2, 4 or 8 cores). With such a parallel architecture, GPUs provide excellent computational platform, not only for graphical applications but any application where we have significant data parallelism. The GPUs thus are not limited to its use as a graphics engine but as parallel computing architecture capable of performing floating point operations at the rate of Tera bytes/s. People have realized the potential of GPUs for highly computational tasks, and have been working in general purpose computation on GPUs (GPGPU) for a long time. However, life before NVIDIA’s Compute Unified Device Architecture (CUDA) was extremely difficult for the programmer, since the programmers need to call graphics API (Open GL, Open MP, Open CV etc.). This also has a very slow learning rate. CUDA solved all these problems by providing a hardware abstraction, hiding the inner details of the GPUs, and the programmer is freed from the burden of learning graphics programming. CUDA is C language with some extensions for processing on GPUs. The user writes a C code, while the compiler bifurcates the code into two portions. One portion is delivered to CPU (because CPU is best for such tasks), while the other portion, involving extensive calculations, is delivered to the GPU(s), that executes the code in parallel. Because C is a familiar programming language, CUDA results in very steep learning curve and hence it is becoming a favorite tool for accelerating various applications. NVIDIA’s CUDA SDK is being employed in a plethora of fields right from the computational finance to Neural network and fuzzy logic to simulations for Nanotechnology.

CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.

· Scattered reads – code can read to arbitrary addresses in memory.

· It is high level-basically an extension to C language. So the learning rate
is much higher as compared to the traditional GPGPU.

Shared memory – CUDA exposes a fast-shared memory region (16KB
in size) that can be shared amongst threads. This can be used as a
user-managed cache, enabling higher bandwidth than is possible using
texture lookups.

· Faster downloads and readbacks to and from the GPU

· Full support for integer and bit wise operations

In short CUDA lets you exploit these tiny supercomputers i.e GPUs, that ships with your graphics cards, and lets you accelerate your applications significantly ,some time as high as 100 times and even more depending upon how smartly you have exploited the resources of GPUs

So why should one use CUDA?

Though GPUs have way more cores than CPU, it is not the main reason for using GPUs. In-fact, the typical clock speed of a GPU core is way less than the CPU clock speeds of today. Secondly, most financial problems are very sequential. However, they are more repetitive, i.e., pricing a single security is sequential but you can price more securities with more cores. GPUs power is really in their ability to handle floating point more efficiency and more importantly, the SIMD support (single instruction multiple data). Suppose, you have to add two vectors, a CPU will take linear time to execute the add operation because you will have a loop in your code to add each element separately. On the other hand, GPU’s support vector add instructions which can typically add up to 128 elements in constant time.

But all this power comes at a cost.
1) You loose portability. GPU code is very much tied to vendor and hardware specific
2) Programming paradigm is different. Once you are on a GPU, OS has very little role in resource management. So, applications have to manage resources like cores and several types of memory and registers on GPUs themselves and also make sure that they are not stepping on each other’s resources
3) The amount of memory on GPU is limited. So your data structures have to be more compact and less fragmented and the application on the CPU will have to move bits and pieces to the GPU and drive the algorithm.
4) Unless you are developing everything from scratch, integrating with existing code is going to be tricky and painful.

For more information see http://www.nvidia.com/objects/cuda_home.html

Nexus 5010 vs Catalyst 4900M

October 30th, 2009

The major difference between Nexus and a normal Catalyst is basically the whole Nexus family is running a different OS (NX-OS) which is based on the MDS storage OS line.
see  http://www.cisco.com/en/US/products/ps9372/index.html

Other signficant differences between Catalyst switches and Nexus switches are the nexus is an cut-though switch where the catalyst is an store-and forward. Nexus supports vPC, which means that you have a multi-chassis EtherChannel trunk from a pair of Nexus 5000/7000 distribution switches to any EtherChannel enabled access switch. This basically doubles the Access distribution bandwidth as you have no links blocked by Spanning Tree.

The Nexus supports FCoE, The Nexus 5010 is a Layer2-only device and the 4900M can do routing. If you’re trying to shave off microseconds, the 5010 will beat the 4900M in switching latency. On the other hand, the 4900M is modular and well suited for mixed, low-density 1gig/10gig deployments, Nexus STP is RSTP not pvst+

Another major difference is the integration of Nexus 2000 Fabric Extenders with Nexus 5000 switches. The Nexus 2000 switches basically act as remote (over 10Gig fiber) “linecards” of the Nexus 5000. This allows deploying top of the rack switches without the additional management overhead

See http://www.cisco.com/en/US/products/ps10110/index.html for more information

Personal Update

October 29th, 2009

I know its been a while since I’ve blogged about anything specific about myself so here goes a quick update! Over the past week I’ve since resigned from Citi and started doing some contract work for Credit Suisse in the low latency space for one of the prop groups ‘GAT’ Global Arbitrage Trading. My current project which I can’t really disclose much of is to build out a new high speed trading infrastructure. This is a massive under taking but once completed I believe this desk will have one of the fastest systems on the street. 

More to come soon!   

Latency Arbitrage

October 29th, 2009

At the most basic level, market participants can generate profits exclusively by exploiting their competitive advantage in latency
by engaging in latency arbitrage. Latency arbitrage involves using your speed advantage to profit from market inefficiencies and
price discrepancies, while trading with counterparties that have latent data. While some market participants frown on this type of
trading, others argue that it serves a purpose in forcing markets to be more efficient and transparent.

Even for the majority of applications that run more complex models than pure latency
plays, low latency is a core requirement of those systems.

MiniFiX - http://minifix.zapto.org/

October 25th, 2009

I recently came across this tool written by Björn Ahlqvist. Mini-FIX is a client/server Windows based application that is able to communicate using the FIX protocol with a high degree of freedom and transparency, well suited for developing and testing FIX applications freely available and distributed under the BSD license.
This product isn’t as robust or feature rich as Verifix or Aegis’s Client Sim but its free and fast!! :-)

Click here for more information

nVidia GT300’s Fermi architecture unveiled

October 21st, 2009

Ferni architecture natively supports C [CUDA], C++, DirectCompute, DirectX 11, Fortran, OpenCL, OpenGL 3.1 and OpenGL 3.2. Now, you’ve read that correctly - Ferni comes with a support for native execution of C++. For the first time in history, a GPU can run C++ code with no major issues or performance penalties and when you add Fortran or C to that, it is easy to see that GPGPU-wise, nVidia did a huge job.You can see more information here
http://www.guru3d.com/news/nvidia-gt300s-fermi-architecture-unveiled/

Bloomberg Uses GPUs to Speed Up Bond Pricing

October 20th, 2009

Each night, Bloomberg calculates pricing for 1.3 million hard-to-price asset-backed securities such as collateralized mortgage obligations (including cash flows, key rate duration and such). Since 1996, the market news giant has performed these calculations — single-factor stochastic models based on Monte Carlo simulations — on a farm of Linux servers in its data centers in New York and New Jersey. “These models are ideal for doing things in parallel, and we did parallelize them over traditional x86 Linux computers,” says CTO Shawn Edwards.


http://www.wallstreetandtech.com/it-infrastructure/showArticle.jhtml?articleID=220200055