Adding Purification Tags to Expression Vectors: His, GST, and SUMO Strategies

how to add a His-tag to a protein expression vectorApril 27, 2026

plasmid design protein expression purification

Adding Purification Tags to Expression Vectors: His, GST, and SUMO Strategies

Adding a purification tag to a protein expression vector is rarely the bottleneck people assume it will be. The actual bottleneck is choosing the right tag for your protein and then putting it in a place that doesn’t silently sabotage expression, folding, or activity. Most online guides describe how to clone in a 6×His tag and stop there. This post covers the three tags you actually choose between in a working lab — His, GST, and SUMO — and how to design each one without breaking your construct.

Written for someone who has done at least a few cloning rounds and now needs to add a tag to a real construct that will be used for protein purification, pull-down, or downstream assays.

How to choose between His, GST, and SUMO

The three tags solve different problems, and putting the wrong one on your protein can mean weeks of confused troubleshooting. Choose by what you need from the tag, not by what is already in the lab’s plasmid drawer.

Tag	Size	Mechanism	Best for	Avoid when
6×His (or 8×/10×His)	~0.8–1.4 kDa	Binds Ni-NTA or Co-NTA via histidine coordination	Routine purification; small tag has minimal impact on folding/activity	Protein has surface-exposed histidine clusters that bind Ni nonspecifically
GST (glutathione S-transferase)	~26 kDa	Binds glutathione resin; elutes with reduced glutathione	Insoluble proteins (acts as solubility tag); pull-down assays	Target is cytotoxic at high expression; GST dimerizes, can force artificial dimers in your protein
SUMO	~12 kDa	Cleaved by SUMO protease (Ulp1/SENP) at the SUMO/target junction, leaving native N-terminus	Proteins that need a perfectly native N-terminal sequence; difficult-to-fold proteins (SUMO acts as folding chaperone)	You need C-terminal tagging — SUMO only works N-terminally

A common pattern in E. coli expression: try a 6×His tag first. If the protein is insoluble or expression is poor, switch to a His-SUMO fusion — this gives you Ni-NTA capture plus SUMO’s solubility benefits, and the SUMO protease leaves a native N-terminus after cleavage. GST is a third-line choice for proteins that fail both, accepting that the 26 kDa fusion partner introduces its own complications.

Tip The Addgene protocol library documents that many bacterial expression projects switch tags at least once. Designing the construct so the tag can be swapped (flanking restriction sites or Gibson overhangs) saves time when the first attempt fails.

Where to place the tag: N-terminal vs C-terminal

Tag position matters more than tag identity for many proteins. The rules:

N-terminal tagging is the default. The ribosome translates from N to C, so an N-terminal tag is in place before the rest of the protein folds. Cleavable tags (TEV, PreScission, SUMO protease) work cleanly here.
C-terminal tagging is required when the N-terminus is functionally important — a signal peptide that must be cleaved, a membrane anchor, or a protein where the native N-terminal residue is required for activity. C-terminal tags are less reliable for solubility rescue.
Don’t put a tag on both ends unless you have a specific reason — you double the chance of folding interference, and dual-tag (tandem affinity) purification requires careful elution strategy.

For a protein with an unknown structure, the safest choice is N-terminal His-SUMO with a TEV cleavage site between SUMO and the protein. This gives flexibility: you can purify on Ni-NTA, cleave with SUMO protease for a native N-terminus, or cleave with TEV if you decide later you want a single residue to remain.

Designing the cassette: linkers, cleavage sites, and reading frame

This is where most cloning errors happen. The cassette must be in-frame with the start codon, in-frame through the entire fusion, and have correct stop placement.

A typical N-terminal 6×His + TEV cleavage cassette looks like:

N-terminal 6×His + TEV cleavage cassette

ATG CAT CAC CAT CAC CAT CAC GGA TCC GAA AAC CTG TAT TTT CAG GGA [protein ORF] TAA
 M   H   H   H   H   H   H   G   S   E   N   L   Y   F   Q  /G  [protein]   *

Read this carefully:

Six histidines (CAT/CAC codons — both encode His and are favored in E. coli)
A short linker (GS) for flexibility
The TEV protease site ENLYFQ/G — TEV cleaves between Q and G, leaving a single G on your protein’s N-terminus
Your protein’s ORF without its own start codon (the ATG belongs to the tag)
A single stop codon (TAA is most common in E. coli)

Common Mistake Including the protein’s native ATG in the fusion. If you copy-paste the gene sequence into your construct, you end up with two methionines: M-H-H-H-H-H-H-...-M-protein. Translation still produces the full fusion (no internal stop), but the extra residue is now between your tag and your protein, which can disrupt cleavage. Always remove the native start codon when fusing to an N-terminal tag.

For C-terminal tags, the inverse rule applies: do not include the protein’s stop codon. If you do, the ribosome stops before reaching the tag, and you get untagged protein. Many cloning failures trace to this single oversight.

Codon usage and host compatibility

The tag itself rarely needs codon optimization — standard His and SUMO sequences in commercial vectors (pET, pGEX, pETSUMO) are already tuned for E. coli. The protein you’re fusing is where the codon problem lives. If your protein has a low codon adaptation index for E. coli, the tag will express fine but the downstream protein won’t. Run the full fusion through a codon optimizer before ordering synthesis — treat the tag-protein junction as one continuous sequence, not two pieces glued together. The same principles apply as for any codon optimization for E. coli expression.

One subtle case: GST has a known rare codon at position 17 in its native sequence. Most modern pGEX vectors have already corrected this, but if you’re using a custom GST sequence (e.g., from an old academic plasmid), check the codons before assuming it will express. Rare codons in the tag itself cause expression failure that looks like a downstream problem.

Cloning strategy: how to actually get the tag in

Three patterns dominate, ranked by how robust they are when something goes wrong:

Start from a tagged backbone. Vectors like pET-28a (His), pGEX-4T (GST), and pET-SUMO already have the tag in place upstream of an MCS. Clone your insert into the MCS in-frame and you’re done. This is the fastest path, and Addgene has hundreds of pre-validated tagged backbones.
PCR-add the tag during amplification. Design forward primers that include the tag sequence as a 5′ overhang. This works for short tags (His, FLAG, HA) but becomes unwieldy past ~30 codons. A 6×His + TEV overhang is ~60 nt, which most synthesis vendors can produce on a primer.
Gibson or Golden Gate to insert the tag separately. Useful when retrofitting a tag into an existing construct that doesn’t have one. Order the tag as a gBlock with overlaps to your vector linearization sites.

For verification after cloning, a diagnostic digest with an enzyme that cuts inside the tag region confirms the tag is present and in the right orientation. Plan the digest in your double-digest buffer compatibility check before going to the bench — tagged constructs often have a unique site introduced by the tag itself (e.g., NdeI in many pET vectors), and that site is your fastest verification handle.

What can go wrong, and how to spot it early

Most tag-related failures show up in one of three ways:

Protein expresses but doesn’t bind the resin. Reading frame is off, or the tag is buried in a folding pocket. Sequence-verify the construct first — it’s almost always a frameshift at the tag-protein junction. If the sequence is correct, try the other terminus.
Protein binds resin but elutes prematurely or never elutes. For His tags, check imidazole concentration — you may have a contaminating histidine-rich protein competing. For GST, check pH and reductant in the elution buffer. For SUMO, you’re using the wrong protease (Ulp1 for yeast SUMO, SENP for human SUMO — not interchangeable).
Cleavage doesn’t go to completion. The cleavage site is partially buried. SUMO and TEV proteases need accessible sites; if your protein’s N-terminus folds back over the cleavage site, you get partial cleavage. The fix is usually to add a longer linker (3–5 residues) between the cleavage site and the protein.

None of these are catastrophic, but each one costs a week. Building the construct with the right tag, in the right place, with the right linker, is the cheapest insurance against all of them.

Quick-reference design checklist

Tag chosen based on goal (purification, solubility, native N-terminus) — not lab convenience
Position chosen based on protein biology (N-terminal default; C-terminal only when needed)
Reading frame verified through the entire fusion — tag, linker, cleavage site, protein
Native start codon removed (N-terminal tag) or stop codon removed (C-terminal tag)
Codon optimization applied to the full fusion, not just the protein
Cleavage site choice matches available proteases in the lab (TEV, SUMO, PreScission)
Diagnostic restriction site identified for post-clone verification
For complex fusions, ordered as a single gBlock from a synthesis vendor — not assembled from primers

If you’re designing a new bacterial expression construct from scratch, the six decisions framework for bacterial expression covers the upstream choices that determine whether your tag will even get translated. The tag is the last design decision, not the first.

Try PlasmidStudio

AI-assisted plasmid design with automated validation. Start free — $0 to sign up.

Get started free