Striking a Balance Between Moving Fast and Incurring Technical Debt

I had a great time as an intern here at Personal Capital. I worked as a member of Personal Capital’s Data Team, getting my hands dirty in a range of projects, from data validation, to daily monitoring, to expanding/updating our ETL process into Amazon Redshift. One of my biggest takeaways was the paramount importance of not allowing the fast pace of the company to erode the use of best practices. Specifically, I grew to appreciate how important it is to write solid tests of all new code, keep good documentation of all changes to the database, and always keep security in the back of your mind.

 

Testing

 

The Personal Capital codebase is primarily written in Java, allowing us to write solid unit tests through a Spring framework. We hold ourselves to a high standard, mandating that our unit tests must touch a significant portion of all new code.  One of the larger tests that I wrote used Mockito to mock large portions of data through which to test our transformation, allowing me to better focus the test as well as maximize its robustness. This test operated on most of the tables in our most important schemas in Redshift, and I wrote it in such a way that it would be as easy as possible to add new tables to the test/amend it to accommodate changes to our Redshift schemas. Coupled with extensive logging, these tests allow us to pinpoint bugs when they are introduced into the codebase.

 

Documentation

 

During my time here, I was part of an ongoing effort to create a general-purpose data warehouse on Amazon Redshift, a single source on which all data exists, thus greatly improving the ease of analysis. By consolidating our data on Redshift, we can now run queries in seconds, queries that would previously take hours. As one might expect, throughout the process, it’s typical for fields to become deprecated, and for fields to be created with ambiguous meaning. Part of my job at Personal Capital was aimed at addressing this problem, through a unified data dictionary that resolves discrepancies in the definitions understood by different teams, as well as consolidates the documentation. This is still a work in progress, but it will ultimately make analysis of Personal Capital’s data much more meaningful. In addition, throughout my time here, I did work that involved deprecating/adding many new fields in existing tables, as well as adding new tables, and I made sure to bolster Personal Capital’s documentation by documenting my changes thoroughly.

 

Security

 

I was really impressed by the security measures put in place at Personal Capital. Shortly after I began working here, I was given a rundown of the security practices used by Personal Capital as well as information about how to avoid introducing vulnerabilities in the code. The engineers here always keep security in mind when writing new code, something that can be easy to forget in the effort to move quickly. In addition, all new code goes through a rigorous review process before release.

 

Conclusion

 

I’m very grateful to have been given the opportunity to work at a startup quickly establishing itself in the financial technology sector. While it is a very fast-moving company, I’m glad that the engineers here don’t lose sight of the long-term picture, always meticulously subscribing to best practices that will greatly improve our outlook years down the road.

We are Growing our Engineering Team!

This is your opportunity to be a disruptor in the financial industry while creating a service that serves families and improves lives.

What sets us apart is the way we use both data aggregation and machine learning technologies along with certified financial advisors to model a family’s financial life. No one has ever had the data, the technology and the people to do what we do: combine the power of an abstract computer model with the expertise of certified financial advisors to help families understand their current financial situation and help them optimize it for the next stages of their lives.

The amazing thing about working at Personal Capital is that when I leave the office each evening, I’m pumped with more energy, enthusiasm and optimism than when I came in. And that’s because I get to build a noble service with great people; it can’t get any better than that. Every day I work with the most talented engineers that I have ever worked with, building sophisticated services that empower thousands of American families to take control of their finances.

We are looking for curious engineers. We are looking for thinkers and doers. You need to be smart and build smart products. You need to be ambitious. This is not an easy job. You will need to wear multiple hats, work with many unknowns, travel many unpaved roads to tackle large-scale problems. But it will be your finest work and creation, and an amazing engineering team is here to collaborate with you and support you.

Here are our current open positions:

 

 

Evolving End-User Authentication

EV Certificate Display

The adoption of EV Certificates has rendered the login image obsolete.

This week, Personal Capital discontinued the use of the “login image”, as part of an upgrade to our security and authentication processes.   By “login image”, I mean the little personalized picture that is shown to you on our login page, before you enter your password.

Mine was a picture of a starfish.

Several users have asked us about this decision and, beyond the simple assertion that the login image is outmoded, a little more background is offered here.

 

The founders and technology principals in Personal Capital were responsible for introducing the login image for website authentication, a decade ago. In 2004, Personal Capital’s CEO Bill Harris founded, along with Louie Gasparini (now with Cyberflow Analytics), a company called PassMark Security, which invented and patented the login image concept, and the associated login flow process. Personal Capital’s CTO, Fritz Robbins, and our VP of Engineering, Ehsan Lavassani, led the engineering at PassMark Security and designed and built the login image technology, as well as additional security and authentication capabilities.

Server login images (or phrases, in some implementations) were a response to the spate of phishing scams that were a popular fraud scheme in the early- and mid-2000s.  When phishing, fraudsters create fake websites that impersonate financial institutions, e-commerce sites, and other secure websites.  The fraudsters send spam email containing links to the fake sites, and unsuspecting users click on the links and end up at the fake site. The user then enters their credentials (username/password), thinking they are at the real site. The hacker running the fake site then has the user’s username/password for the real site and, well, you know what happens next. It’s hard to believe that anyone actually falls for those sorts of things, but plenty of people have. (Phishing is still out there, and has gotten a lot more sophisticated (see spear-phishing for example), but that is a whole other topic).

So, the login image/phrase was a response to the very real question of:  “How can I tell that I am at the legitimate website rather than a fraudulent site?”  With login image/phrase, the user would pick/upload a personalized image or phrase at the secure website. And the login flow changed to a two-step flow: the user enters their username, then the secure site displays the personal image/phrase, and then, assured that they are at the legitimate secure site when they recognize the image/phrase, the user enters their password. The use of login image/phrase was a simple and elegant solution to a vexing problem. And when the FFIEC (U.S. banking regulatory agency) mandated stronger authentication standards for U.S. banking sites in 2005, login image quickly became ubiquitous across financial websites, including Bank of America and many others, during the mid-2000s.

From a security perspective, the login image/phrase is a kind of a shared secret between the secure site and the user. Not as important a secret as the password, of course, but important nonetheless, and here’s why: If a hacker posing as the real user enters the user name at the secure site, and the site displays the user’s login image/phrase then the hacker can steal the image/phrase and use it in constructing their fake website. Then the fake website would then look like the real website (since it would have the image/phrase) and could then fool the user to giving up the real prize (the password) at the fake phishing site. So, the issue of “how to protect the security of the login image?” becomes a relevant question.

Device identification is the answer:  If the website is able to recognize the device that is sending a request containing the username, and if the site knows that device has been authorized by the user, then the site can safely show the login image/phrase, and the user feels secure, and enters their password. This is essentially a process of exchanging more information in each step of the authentication conversation, a process of incremental and escalating trust, culminating in the user entering their password and being granted full access to the site.

But the use of device identification to protect the login image is secondary to the real technology advance of this approach: the use of device identification and device forensics as a second factor in authentication. Combining the device identity with the password creates a lightweight form of two-factor authentication, widely recognized as being far superior to single-factor (password only) authentication.

The simplest form of device identification involves placing a web cookie in the user’s browser. Anyone out there not heard of cookies and need an explanation? OK, good, I didn’t think so. Cookies work pretty well for a lot of purposes, but they have a couple of problems when being used for device identification: (1) the user can remove them from the machine; and (2) malware on the user’s machine can steal them.

The technology of device identification quickly evolved, at PassMark and other security companies, to move beyond cookies and to look at inherent characteristics of the web request, the browser, and the device being used. Data such as the IP address, User-Agent header (the browser identity information), other HTTP headers, etc. Not just the raw data elements, but derived data as well, such as geolocation and ISP data from the IP address. And, looking at patterns and changes in the data across multiple requests, including request velocity, characteristic time-of-day login patterns, changes in data elements such as User-Agent string etc.  Some providers started using opt-in plugins or browser extensions to extract deeper intrinsic device characteristics, such as hardware network (MAC) address, operating system information, and other identifiers.

“Device forensics” evolved as the practice of assembling large numbers of data points about the device and using sophisticated statistical techniques to create device “fingerprints” with a high degree of accuracy. The whole arena of device identification and device forensics is now leveraged in a variety of authentication and fraud-detection services, including at Personal Capital. This is the real value that grew out of the “login image” effort.

But, while the use of device identification and device forensics was flourishing and becoming a more central tool in the realm of website authentication, the need for the login image itself was becoming less compelling.

Starting in the late 2000s, the major SSL Certificate Authorities, (such as Verisign), and the major browser providers (such as IE, Firefox, Chrome, Safari) began adopting Extended Validation (EV) certificates. These certificates require a higher level of validation of the certificate owner (i.e. the website operator, such as Personal Capital), so they are more trusted. And, just as important, the browsers adopted a common user interface idiom for EV certificates, which include the display of the company name (e.g. “Personal Capital Corporation”) that owns the certificate, displayed in a distinctive color (generally, green) in the browser address bar (see picture). The adoption of EV certificates has essentially tackled the original question that led to the use of the login image (i.e. “how does the user know they are at the real website?”).

Which brings us to today. Personal Capital has removed the login image from our authentication flow. It is a simpler and more streamlined flow for our users, and has the added benefit of reducing complexity in the login process. It is a security truism that, all else being equal, simpler implementations are more secure implementations – fewer attack vectors, fewer states, fewer opportunities for errors. Personal Capital continues to use device identification and device forensics, allowing users to “remember” authorized devices and to de-authorize devices. We also augment device identification with “out of band” authentication, using one-time codes and even voice-response technology to verify user identity when they want to login from a non-authorized or new device.

I’ll admit that I will miss my little starfish picture when I log in to Personal Capital. But this small loss is offset by my knowledge that we are utilizing best, and current, security practices.

PassMark Security, circa 2005

“Ugly Shirt Fridays” at PassMark Security, circa 2005

Hibernate Second Level Cache Using Ehcache

Introduction

Put aside all the good and bad things about hibernate second-level cache, Personal Capital has been using second-level cache for the entities and queries of data that merely change. We had our issues while using it. But once we had these issues behind us, it really performs well for us.

How to enable in Hibernate

There are two types of second-level cache we use. One is entity cache, where hibernate caches loaded data objects on the Session Factory level, crossing user and session boundary, and crossing transaction boundary; Another type is query cache, where SQL query is the cache key and the query result set is the cached value. Whenever there is a same SQL query is submitted through hibernate, hibernate will try to use the cached result set instead of dipping into the database.

To globally set these caching up, just need to define a few hibernate properties. We use JPA, the following properties need to be defined in JPA persistence.xml.

Property
Value
hibernate.cache.use_query_cache true
hibernate.cache.use_second_level_cache true
hibernate.cache.region.factory_class org.hibernate.cache.ehcache.EhCacheRegionFactory
hibernate.cache.use_structured_entries true

 How to enable second-level caching for data objects in application

Hibernate provides a @Cache annotation to enable second-level caching for data objects in application.

This annotation can be used on @Entity class to enable second-level cache on the table level. For example,

@Entity
@Table(name = “account_type”, catalog = “sp_schema”)
@Cache(usage = CacheConcurrencyStrategy.READ_WRITE)
public class AccountTypeImpl extends AbstractTableWithDeleteImpl implements AccountType, Serializable
{
… …
}

it can also be used on @Column property level to enable

@Entity
@Table(name = “account”, catalog = “sp_schema”)
public class AccountImpl extends AbstractTableWithDeleteImpl implements AccountType, Serializable
{
… …
@Column(name=”detail”)
@Cache(usage = CacheConcurrencyStrategy.READ_WRITE)
public String getAccountDetail()
{
}
… …
}

How to enable second-level caching for query cache in application

Query cache is different from data object cache. To enable query cache, you need to tell that to the query. In JPA query, a cache hint should be set to enable query cache.

query.setHint(“org.hibernate.cacheable”, true);

How to configure ehcache for data object second-level cache

When ehcache is used for second-level cache, hibernate tries to work with caches named after the fully qualified names of Domain Objects. Only if there is no matching cache name for a domain object, it tries to cache/retrieve this domain object using defaultCache.

So ehcache’s defaultCache should be configured carefully since any @Cache annotated domain objects will be cached in this cache, if no cache is configured using these domain objects’ fully qualified names.

All ehcache’s cache names can be defined and configured in ehcache.xml. ehcache.xml must be on application’s class path. A simple example;

<ehcache updateCheck=”false”><diskStore path=”java.io.tmpdir/hibernateCache” /><defaultCache maxElementsInMemory=”100″ eternal=”false” timeToIdleSeconds=”1″ timeToLiveSeconds=”1″ overflowToDisk=”false” diskPersistent=”false” diskExpiryThreadIntervalSeconds=”1″ memoryStoreEvictionPolicy=”LRU” /><cache name=”com.personalcapital.aggregation.data.impl.AccountTypeImpl”
maxEntriesLocalHeap=”1000″ eternal=”false” timeToIdleSeconds=”1200″
timeToLiveSeconds=”1200″ overflowToDisk=”false”
diskExpiryThreadIntervalSeconds=”60″ memoryStoreEvictionPolicy=”LRU” />

How to configure ehcache for second-level query cache

For second-level query cache, Hibernate uses caches named as “StandardQueryCache” and “UpdateTimestampsCache” to cache query result sets. If these caches are not configured in ehcache.xml, Hibernate will fall back to use “defaultCache” to cache query result sets, which may not be what is expected.

A note. The full qualified StandardQueryCache names are different in Hibernate 3 and 4. In Hibernate 3, the name is “org.hibernate.cache.StandardQueryCache”; and in Hibernate 4, the name is “org.hibernate.cache.internal.StandardQueryCache”. This was one thing had bite us when we were upgrading to Hibernate 4. The following is an example for Hibernate 4 ehcache.xml configuration for query cache.

“UpdateTimestampsCache” is really important too. Here is an excerpt from ehcache online document (http://ehcache.org/documentation/user-guide/hibernate#enable-second-level-cache-and-query-cache-settings), describing what “UpdateTimestampsCache” is: “Tracks the timestamps of the most recent updates to particular tables. It is important that the cache timeout of the underlying cache implementation be set to a higher value than the timeouts of any of the query caches. In fact, it is recommend that the the underlying cache not be configured for expiry at all.

… …<cache name=”org.hibernate.cache.internal.StandardQueryCache” maxEntriesLocalHeap=”2000″ eternal=”false” timeToIdleSeconds=”0″ timeToLiveSeconds=”300″ overflowToDisk=”false” diskExpiryThreadIntervalSeconds=”60″ memoryStoreEvictionPolicy=”LRU” /><cache name=”org.hibernate.cache.spi.UpdateTimestampsCache” maxEntriesLocalHeap=”5000″ eternal=”true” />… …

Native Query and Query Cache

Result set of native query can be cached in query cache the same way as result set of regular HQL query, you just need to tell that to the query. In JPA, the same cache hint should be set to enable query cache for native query.

query.setHint(“org.hibernate.cacheable”, true);

But there is a drawback to second-level cache when using native query. Because it is not HQL, Hibernate has no idea what data objects are affected by the native query, so it invalidates any cached data objects by default. This behavior usually is not desired. In order to avoid this global invalidation, the data entity is affected by the native query should be indicated before the query execution, so the query will only invalidate data objects of the indicated entity. If nothing is affected by the query and don’t want to invalidate anything, any empty query space should be specified.

query.addSynchronizedQuerySpace(“”);

Some ending words

Again, Hibernate second-level cache can turn into a monster if is not applied carefully in application. Caching strategy should be thought out carefully before applying to application. Otherwise, the application may have all kinds of unexpected behaviors. What are described above only cover the caching strategy we are using, basically it is a simple time-to-live timeout-bound caching. This is good for our seed data that have small data set and don’t change frequently.

Template Based approach for SOAP webservices.

Problem statement: 

The problem can be generalized as: Maintenance issues with auto generated stubs & skeletons in webservices (particularly when request needs to be generated dynamically based on account types).

As Pershing already exposed an openAccount webservice we can generate stubs & skeletons etc and generate the request & response. In this module we have to dynamically generate requests based on the account types. In order to do this we have to actually maintain the mapping at two places one is at the business level i.e. field mapping between personal capital in personal cap& other one is at database level so, the server code can access it & use it in generating dynamic requests. This may lead to inconsistencies between the mappings business maintains & the code base server team maintains. Also just to maintain the code base is really difficult as the openAccount webservice contains so many properties to be set based on the account type.
Solution:
To solve this we created the templates required as String based templates (xml format)  & values to be populated are dynamically constructed from the excel file maintained by the business (So, always the source of truth is the excel file maintained by the Business). We construct the request based on the account type and  invoke openaccount webservice endpoint url without considering the stubs & skeletons that are auto generated as shown below. This resulted in less maintenance in code and also made only 1 place to maintain the mappings which resulted in avoiding in consistencies.
For example:

SOAPConnectionFactory scf = SOAPConnectionFactory.newInstance();

SOAPConnection con = scf.createConnection();

// Create a message factory.

MessageFactory mf = MessageFactory.newInstance();

// Create a message from the message factory.

SOAPMessage soapMsg = mf.createMessage();

// Create objects for the message parts

SOAPPart soapPart = soapMsg.getSOAPPart();

StreamSource msgSrc = new StreamSource(new StringReader(inputSoapMsg));

soapPart.setContent(msgSrc);

// Save the message

soapMsg.saveChanges();

// soapMsg.writeTo(System.out);

URLEndpoint urlEndpoint = new URLEndpoint(this.getPershingEndPoint());

reply = con.call(soapMsg, urlEndpoint);

if (reply != null)

{

ByteArrayOutputStream out = new ByteArrayOutputStream();

reply.writeTo(out);

if (out != null)

{

output = out.toString();

logger.info(“The Response message is:” + output);

}

}